# COGS 108 - Final Project Proposal
## A Journey Across the Stars

# Names
- Zhenyi Chen
- Alison Camille Dunning
- Matthieu Dante Pardin
- Amogh Patankar

# Research Question

#### Out of the various characteristics such as temperature, brightness, luminosity, radius, color, and spectral class, which are the most statistically influential on a star’s type, and which statistical is most accurate for determining this?
![Stars](imgs/stars_cropped.jpg)

<a id='background'></a>
## Background and Prior Work

For centuries, outer space has been considered to be one of the great unknowns of civilization- specifically, the different objects that exist. Of those objects, stars are one of the most researched cosmic bodies, and tend to be classified in a specific manner. The classification of stars in galaxies is dependent on a variety of a factors which qualify a star to fall into a certain category. There are many different classifications that are used to categorize stars, and they are called spectral classes/types. 

The classification of stars in these "classes" also corresponds to a diagram known as the Hertzsprung-Russell Diagram, which represents the absolute magnitudes of stars plotted against all spectral types/classes. The spectral classes are in order of descending temperatures- O, B, A, F, G, K, M [1]. This would indicate that an O-type star would be the most hot and bright, while a M-type is the coolest and dimmest. [2]



<a id='hr'></a>
**The Hertzsprung-Russell Diagram**



![HR](imgs/HRDiagram.jpg)

The Hertzsprung-Russell Diagram is a graph that has evolved from its original creation by Danish astronmer Ejnar Hertzsprung and Henry Norris Russell [2]. This diagram ranks the brightness and spectral type against the temperature of the stars in the universe. is shown above [3].


In this encoding of stars with respect to their temperatures, cooler stars have a whiter appearance, with hotter stars represented as warmer and more orange. As shown in the diagram above, the main categories of star naming are Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence, Super Giants, and Hyper Giants. Brown dwarfs are a category of stars called substellar objects, and are smaller than white and red dwarfs, as these are the stars which don't reach nuclear fusion, i.e. they don't "ignite" [4]. White dwarfs are one of the conclusions of a star's life cycle, they are lower in magnitude. Red dwarfs are larger than brown and white dwarfs, they are the largest "dwarf" stars. Main sequence stars make up around 90% of the universe's stars; the end of their life cycles means they leave the main sequence, and become white dwarfs [4]. Finally, super and hyper giants are the biggest stars in the universe- hypergiants tend to be larger than super giants. Both star types do not become white dwarfs; and they "die off" by exploding as supernovae. 

Because there is a never-ending amount of stars, and our universe keeps expanding, the idea of star classification has no limit, which is the driving force behind our research question and hypothesis.

References (include links):
- 1) https://astrobackyard.com/types-of-stars/
- 2) https://www.britannica.com/science/Hertzsprung-Russell-diagram
- 3) https://upload.wikimedia.org/wikipedia/commons/6/6b/HRDiagram.png
- 4) https://www.space.com/22437-main-sequence-stars.html

# Hypothesis
The challenge of the features we currently have are that they are likely highly correlated and depend on one another. Additionally, some of the the true accuracy of the measures might have been beyond what we can currently attain. For this reason, we propose that all variables will contribute equally to a certain degree, and that we will not need to use a complex model to estimate a star's type.

# Data

*Explain what the ideal dataset you would want to answer this question. (This should include: What variables? How many observations? Who/what/how would these data be collected? How would these data be stored/organized?)*

We employ for our analysis a rather simple dataset, extracted from NASA's larger databases. It has only 240 observations and six features, but this gives opportunity for us to explore multiple models. Additionally, our target values are normally distributed, and the varialbles have many different measurement types. Below is some meta-information about our dataset:

- **Name**: Star Type Classification ([Link to Dataset](https://www.kaggle.com/brsdincer/star-type-classification))
- **Description**: This dataset contains information about discovered stars, including their types ([Background and Prior Work](#background)) and other physical attributes.  The purpose of this dataset is to use statistics and modeling to prove that stars can be put into clusters, like those shown on a [Hertzsprung-Russell diagram](#hr), according to their temperature and luminosity.

## Variables
| Variable Name  | Measurement Type | Description/Meaning
|---|---|---
| Temperature  |  Numerical (Interval) | Temperature in Kelvins
| L | Numerical (Ratio) | Relative Luminosity ($L_o=3.828 \times 10^{26}$ W, the average luminosity of the Sun)
| R | Numerical (Ratio) | Relative Radius ($R_o=6.9551 \times 10^{8}$ m, the average radius of the Sun)
| AM | Numerical (Ratio) | Absolute Magnitude
| Color | Categorical (Nominal) | General color in the spectrum
| Spectral Class | Categorical (Ordinal) | Stellar Classification of the star
| Type | Categorical (Nominal) | One-hot encoded values of any of the following types: Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence, Super Giants, Hyper Giants

### Variable Backgrounds
#### Temperature in Kelvins
The **Kelvin scale** is a temperature scale that is frequently used in astronomy. Astronomers need another scale so they can write numbers in a more convenient manner. The Fahrenheit and Celsius scales are set up specifically for the numbers to be reasonable for Earth's temperatures, but in space, they are much hotter or much colder. Additionally, there are no negative numbers in the Kelvin scale, rendering it easy to write very cold temperatures. -391 degrees Fahrenheit would translate to 38 Kelvins. To introduce some perspective into our dataset, the Sun's core is about 15 million Kelvins (27 degrees Fahrenheit).
#### Luminosity
**Luminosity** is the amount of electromagnetic energy emitted per unit of time by a star, and it is measured in joules per second, or watts. In astronomy, luminosity is typically indicated in terms of the average luminosity of the Sun, $L_o$. This differs from **brightness** in that brightness is a measure of appearance that takes into account both the luminosity, the distance from the viewer, and the absorption of light along the path to the observer. Stellar luminosity depends on both the size (represented in terms of $R_o$, and the temperature in kelvins. Most of the time, neither can be directly measured. To measure a star's radius, we use the **angular diameter** and its distance from Earth (**stellar parallax**). The visualization for angular diameter computation is shown below, where $\delta=2arctan(\frac{d}{2D})$.

![Angular diameter](imgs/Angular_dia_formula.JPG)

Alternatively, a star's luminosity is measured with its apparent brightness and distance.
#### Absolute Magnitude
**Magnitude** is a measure of brightness of a star. Stars are assigned to magnitude classes, with the lowest number class being the brightest. The scale for magnitude is logarithmic. Each step of one magnitude changes the brightness by a factor of the fifth root of 100, so a magnitude one star is approximately 100 times brighter than a magnitude six star. The magnitude is measured either absolutely or apparently. Apparent magnitude is the magnitude of a celestial object as it appears in the night sky. The absolute magnitude is represented in terms of what the magnitude the object it would be if it were placed 10 parsecs (1 parsec ~= 3.26 light-years ~= 30.9 trillion km) away from Earth. We use the measure of the absolute magnitude in our dataset. The Sun has an apparent magnitude of -27.

# Ethics & Privacy

The dataset in use for this specific project belongs to NASA and is readily provided on Kaggle. It is publicly available and anyone could use the data for their own analysis, hence we are eligible to have such access to the data we are using in this project. Considering the credibility of NASA as a company, we believe that the data being used in this project is transparent. The variables that are being measured such as temperature and luminosity are objective and since it doesn't involve any individual in particular, data privacy is not that much of an issue. With that laid out, our intent is to come up with an objective analysis with ethics and privacy concerned from as many angles possible.

Even if the data that we are using are ethical, we can't deny the possibility of biases. An example of such biases include the accuracy of the equipment used by NASA. We understand that measuring these variables to the dot is quite frankly impossible. Hence, perhaps it is to an extent that the data that is provided in the dataset comprises of rough estimates. Unfortunately, it was not indicated on how far off these values are. If these data are strictly used for experimental purposes, then we trust that these measurements are recorded with the utmost precision. 

# Team Expectations 

Our group’s expectation is to get a good visualization of the cleaned data and conclude a best-fit answer by statistically analyzing the data and graphs for the research question we are working on. Our communication is mostly through the discord group chat to brainstorm and conceive the project. For now, it would be the most efficient way for us to share ideas because of having members from different time zones. If necessary, we will schedule a zoom meeting to update our latest progress of the project and discuss further the undecided issues. Every one of our team members whose names are listed above has read the COGS108 Team Policies. All of us agree to achieve this team’s expectations and are willing to dedicate ourselves to accomplish the project.

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 04/16  |  4:15 pm | Read the previous project from last quarter and think about this course’s project expectations; brainstorm topics and search for a usable interesting dataset  |Choose a communication platform; discuss the possible interesting final project topics | 
| 04/22  |  20:30 pm |Search for a valid dataset; draft the research question; background research; draft project proposal; construct scatterplot to visualize the data | Going through the selected dataset, discuss the research question, think about the hypothesis, agreement on the ethics, privacy, and team expectation | 
| 04/30  | 20:30 pm  | Import and wrangle data; EDA  | Review and edit wrangling; EDA   |
| 05/07  | 20:30 pm  | Checkpoint: data submission review | Discuss data analysis plan; finalize wrangling data   |
| 05/14  | 20:30 pm  | Implement data analysis | Discuss, review and edit data analysis and draft result/ conclusion |
| 05/21  | 20:30 pm  | Checkpoint: EDA| Finalize EDA |
| 05/28  | 20:30 pm  | Refine the conclusion and keep working on the completeness of the project| Going through the draft of the final project and discuss the conclusion; fix the uncompleted parts of the final project |
| 06/04 | 20:30 pm | Conceive for the final video, refine the final report| Discuss the final project and shoot the final video; review the checklist for the final project and implement improvements|
| 06/09 | 20:30 pn | Final Report, the final video production | Final check for the final report and the final video, turn them in; Group project surveys
