# DI 501 Assignment 1

### Due: November 22, Wednesday by 23:30
---

### Submission and Grading Principles

**Assignment Submission Guidelines:**

* Submit your assignments via the assignment module on [ODTUClass](https://odtuclass.metu.edu.tr).

* You will work on this file, and rename it as "*name_surname_a1.ipynb*" (e.g., "*volga_sezen_a1.ipynb*").

* Late submissions will be accepted until November 26, Sunday by 23:30, with a 10% penalty per day.

* **<font color=#C91515>This is an individual assignment; do not collaborate, and uphold academic integrity principles.</font>**

* Offer insightful commentary about your results. Data understanding relies on your reasoning; not just numbers, tables, or graphics.

* Include your code in the designated code blocks and your commentary in markdown blocks.
<br>*You can, of course, use multiple code blocks.* **Make sure printouts and graphs are visible before submission.**

------------

### The aim of this assignment is getting you familiar with:

* Python and Jupyter notebooks,

* Data cleaning,

* Descriptive statistics interpretation,

* Visualization methods, and

* Statistical tests

------------

### Dataset Description

You will work with data from the "Indo-U.S. Coudé Feed Stellar Spectra Library" which is a catalogue of stars in the sky.

This dataset is a merger of two tables which includes metadata and calculations derived from observations conducted at Kitt Peak, Arizona. 

Each row represents a unique star, and you will analyze various attributes associated with them.

General characteristics:

| Variable Name | Explanation |
|---|---|
| index | Name of the star in main catalogues |
| SpType | Spectral class of the star [$^{1}$](#notes)|
| RA | Right ascension, or $\alpha$. ($^{\circ}$, eq = J2000) [$^{2}$](#notes)|
| DEC | Declination, or $\delta$. ($^{\circ}$, eq = J2000) |

Observational parameters:

| Variable Name | Explanation |
|---|---|
| B | Measured brightness of the B(lue) filter (mag) [$^{3}$](#notes)|
| V | Measured brightness of the V(isible) filter (mag) |
| RV | Radial velocity (along line of sight) of the star (Km/s) [$^{4}$](#notes)|
| Teff | Estimated effective temperature of the star surface (Kelvin) |
| e_Teff | Window of uncertainty for surface temperature |
| log(g) | Log of estimated surface gravity (Dex, $10^x$, compared to the Sun) |
| e_log(g) | Window of uncertainty for surface gravity |
| [Fe/H] | Estimated metallicity (Dex, $10^x$, compared to the Sun) [$^{5}$](#notes)|
| e_[Fe/H] | Window of uncertainty for metallicity |

Other miscellaneous columns:

| Variable Name | Explanation |
|---|---|
| lum | Luminosity class extracted from SpType |
| misc | Any peculiarities about the spectra extracted from SpType |
| PHYSREF | References to the literature regarding estimation of the physical parameters |
| Names | Other names for the star in different catalogues for cross-referencing |

#### Notes:

1. Made up of three components, temperature class (B1 for ex.), luminosity class (Roman numeral), and other reported oddities.

    > First two are labeled according to Morgan-Keenan-Kellman classification system. <br>
    > 
    > * The temperature class has two parts, a letter (main spectral type), and a digit (subtype).  
    >   * The main sequence follows this order: O, B, A, F, G, K, M, excluding newer types. <br> The sun is classified as G5. Here 5    signifies the sun is right in-between G0 and G9. <br>
    >
    > 
    > * Luminosity class is shown by roman numerals from I to VI, with I having a or b suffix only.

2. Equitorial astronomical coordinate. ra/dec is analogous to the x/y direction, but represents coordinates inside a sphere, <br> and thus coordinates wrap around.

3. Defined in the UBV photometric system as the following: <br>B filter has a mean wavelength response of 442 nm, while for V the mean response is at 540 nm.

4. i.e. the speed at which the star is moving toward or away from Earth. Estimated by the doppler effect.

5. i.e. proportion of metals (elements heavier than helium) to hydrogen in the star's chemical makeup.

### 1) Environment setup
Import numpy, pandas, matplotlib, seaborn, statsmodels and scipy, along with other libraries you may utilize.

In [8]:
## Fill here ##
# Path: DI 501 Assignment 1.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import math
import random


### 2) Initial inspection (5p)

1. Load the data given with the correct separator.

1. Show how many observations and features are present.

1. Then show the last 6 rows.

In [None]:
## Fill here ##

### 3) Data cleaning and augmentation (15p)

**Assume you are a data scientist working on stellar distributions. You care about all of the columns of this dataset, especially ones on physical characteristics and fields relating to research.**

1.  * Report how many NA values are present in each column, and the value count of NA's in rows. (How many rows have 0,1,2... NA's)

    * Then show the **correlation** of missingness between columns via missingno's heatmap.

    * Devise a strategy to get rid of columns and rows with NA's. (*The order matters here!*) Define your thresholds and report them clearly.
    
    * After removing NA's print the shape of the dataset.
<br>
<br>
2.  * Define a new variable called "spec" that only consists of the main spectral class. (First capital letter in SpType) 
    
    * Then, generate a variable called "sub" that only consists of the subclass. (First digit in SpType).
    
    * Combine them using string addition to create a final variable called "type".
<br>
<br>
3. Normalize column names to be lowercase AND to not contain any non-letter character.

4. Show that your solutions worked by calling the info method of the dataframe.

*Hint: You will benefit a lot from using [regular expressions](https://images.datacamp.com/image/upload/v1665049611/Marketing/Blog/Regular_Expressions_Cheat_Sheet.pdf) and [lambda functions](https://www.datacamp.com/tutorial/python-lambda) here.*

In [None]:
## Fill here ##

### 5) Interpreting descriptive statistics (15p)

Just by looking at descriptive statistics (and no graphs yet) of each numeric column determine whether they possess a left-skewed,<br>  right-skewed, or a symmetric distribution. *You don't have to give definite answers, but rather explain which category feels closest.*

In [None]:
## Fill here ##

> Fill here by double clicking me!

### 6) Visualizing distributions (15p)

1. Visualize the distributions of numeric variables and determine whether they are close to a gaussian distribution or not.<br>Provide additional commentary for any peculiarities you spot. (*You can use subplots and loops to put multiple graphs inside the same figure.*)

1. For the categorical variables, use appropriate visualisations and comment on whether the sample is balanced or not.

In [None]:
## Fill here ##

> Fill here by double clicking me!

### 7) Quantile-Quantile plots and hypothesis testing (15p)

1.  * Select a variable that looks like it has a normal distribution, rid it of any NA values and sort it. Assign it to a new array.
    
    * Create an array called "s1" filled with random numbers sampled from the standard normal distribution. $(\mu = 0,\ \ \sigma = 1)$ 
    <br>Then sort this array in an ascending order. Make sure this array has equal length to the array above.
    
    * Assemble a scatter plot with s1 in the x axis and your variable without NA's in y axis. Then draw on top (or to the side) a qqplot of your variable using stasmodels.

<br>

2.  * This time, select a variable that has a right skewed distribution, and repeat the operations above.

    * Next, create a similar array called "s2", but this time sample from the log-normal distribution. $(\mu = 0,\ \ \sigma = 1)$

    * Assemble another scatterplot using seaborn's `regplot` function. Set `robust=True` and `ci=None` for the regression fit to be <br> robust to outliers coming from random sampling. Also draw another statsmodels qqplot next to it, or on top.

    Do the plots you generate resemble statsmodels' qqplots?<br><br>**Explain how the similarities came about, as well as any differences.**

3.  * Then apply the shapiro-wilk normality test onto the two variables you chose. 

    * State the null hypothesis, find the test statistic and p value for each, the level of significance you chose when evaluating the hypotheses and your final judgement on both.

In [None]:
## Fill here ##

> Fill here by double clicking me!

### 8) Correlations (15p)

Generate a pairwise pearson correlation matrix of numeric columns, visualise them with correlation coefficients visible and comment on any potentially interesting relationships.

In [None]:
## Fill here ##

> Fill here by double clicking me!

### 9) Testing the spectral sequence (20p)

1. * Use graphs (or `groupby`) to see which of the physical characteristics can explain the star's class. (The column you created named 'spec'.)
        <br>In other words, select which features would you use in order to seperate classes easily by straight lines, and plot the distributions of those features for each class.
<br>
<br><br>

2. * Then test the claim that the mean log(g) of A class stars is less than mean log(g) for B class stars using a paired t test. 
    <br>*Note that there may not be the same number of A class and B class stars. Sample the bigger class down to the size of the smaller one.*
    
   * Like usual, state your null hypothesis, calculate the test statistic, choose a significance level and find a p value using stats package and finally give your judgement on the hypothesis.

In [None]:
## Fill here ##

> Fill here by double clicking me!