<center> <H1> Data Preprocessing and Feature Selection </H1> </center>

In [None]:
import pandas as pd
%pylab inline

<center> <H2> Features in the dataset </H2> </center>

| Key | Description |
|----:|:------------|
|[0] amplitude (B) |  Amplitude from the Fourier decomposition  |
|[1] cusum (B) |  Cumulative sum index  |
|[2] hl_amp_ratio (B) |  Ratio of higher and lower magnitudes than the average | 
|[3] kurtosis (B) |  Kurtosis   |
|[4] period (P-V) |  Period   |
|[5] period_SNR (P-V) | SNR of period derived using a periodogram |
|[6] period_uncertainty (V) | Period uncertainty based on a periodogram |
|[7] phase_cusum (V) |  Cumulative sum index over a phase-foled ligit curve  |
|[8] phase_eta (V) |  Eta over a phase-foled ligit curve   |
|[9] phi21 (V) |  2nd and 1st phase difference from the Fourier decomposition   |
|[10] phi31 (V) |  3rd and 1st phase difference from the Fourier decomposition    |
|[11] quartile31 (B) |  3rd quartile - 1st quartile   |
|[12] r21 (P-V) |  2nd and 1st amplitude difference from the Fourier decomposition   |
|[13] r31 (P-V) |  3nd and 1st amplitude difference from the Fourier decomposition   |
|[14] shapiro_w (V) |  Shapiro-Wilk test statistics  |
|[15] skewness (B,V) |  Skewness   |
|[16] slope_per10 (B) |  10% percentile of slopes of a phase-folded light curve   |
|[17] slope_per90 (B) |  90% percentile of slopes of a phase-folded light curve   |
|[18] stetson_k (V) |  Stetson K  |
|[19] weighted_mean (B) | Weighted mean magnitude |
|[20] weighted_std (B) | Weighted standard deviation of magnitudes |

## Classes in the dataset

#### Class 1 : Planets

<img src="Planet.jpg" height="200" width="200">

#### Class 2 : RR Lyrae (Periodically variable star)

<img src="RRLyrae.jpg" height="200" width="200">

#### Class 3 : Supernovae

<img src="Supernova.jpg" height="200" width="200">


## Load your data into pandas

In [None]:
df = pd.read_csv("dirty_data.csv")

## Have a quick look at your data using head and tail

In [None]:
df.head()

In [None]:
df.tail()

## A quick overview of the data using describe

In [None]:
df.describe()

## Is anything out of place?

### Fix -999

In [None]:
df[df["phase_cusum"]==-999]

In [None]:
df["phase_cusum"].iloc[104]

In [None]:
df.at[104, "phase_cusum"] = NaN

### And the other occurance

In [None]:
df[df["slope_per90"]==-999]

In [None]:
df.at[81, "slope_per90"] = NaN

## Let's deal with that infinity we found

In [None]:
df[df["skewness"]==inf]

In [None]:
df.at[579, "skewness"] = NaN

In [None]:
df.describe()

## Let's have a look at the data types

In [None]:
df.dtypes

In [None]:
len(df.dtypes)

In [None]:
df["cusum"].min()

## Do you notice something subtle above?

In [None]:
df["cusum"].sum()

## Looks like it thinks these are strings. Maybe there is a string in here?

In [None]:
df["cusum"].iloc[0]

In [None]:
for i in range(len(df)):
    float(df["cusum"].iloc[i])

## Let's search for 'hello'

In [None]:
df[df["cusum"]=='hello']

## Now replace it with NaN

In [None]:
df.at[253, "cusum"] = NaN

In [None]:
df.describe()

## The 'cusum' column hasn't appeared

In [None]:
df["cusum"] = pd.to_numeric(df["cusum"])

In [None]:
df.describe()

In [None]:
df.dtypes

## Another good way to look for strings

In [None]:
df.sum()

## The easy way to get rid of strings. This makes them go to NaN automagically.

In [None]:
df = df.apply(pd.to_numeric, errors="coerce")

In [None]:
df.describe()

## Now we have all of our columns and they are of type float

In [None]:
df.dtypes

## Let's start dealing with these NaNs (null)

In [None]:
df.isnull().any()

## The most important to deal with is the 'class'

In [None]:
df[df["class"].isnull()]

In [None]:
df = df.drop(df.index[[36, 614]])

In [None]:
df.isnull().any()

In [None]:
df.isnull().sum().sum()

## We still have 12 more instances to deal with you can see them below

In [None]:
df[df.isnull().any(axis=1)]

## An amazing trick below. What does it do?

In [None]:
df[df.columns[:-1]] = df.groupby("class").transform(lambda x: x.fillna(x.mean()))

In [None]:
df.isnull().sum().sum()

## All fixed?

In [None]:
df.describe()

# Phew!

<center> <H1> Data Normalization </H1> </center>

$$x := \frac{x-\bar{x}}{\sigma(x)}$$

In [None]:
df[df.columns[:-1]] = (df[df.columns[:-1]]-df[df.columns[:-1]].mean())/df[df.columns[:-1]].std()

In [None]:
df.describe()

<center> <H1> Feature Selection </H1> </center>

In [None]:
pd.plotting.scatter_matrix(df[df.columns[:-1]], c=df["class"], alpha=0.6, diagonal="kde", figsize=(15,15));

In [None]:
features = [1, 2, 3]
pd.plotting.scatter_matrix(df[df.columns[features]], c=df["class"], alpha=0.6, diagonal="kde", figsize=(15,15));

<center> <H1> Saving your CLEAN data </H1> </center>

In [None]:
chosen_features = [5, 9, 19]
features_with_class = chosen_features + [21,]
df[df.columns[features_with_class]].to_csv("clean_data.csv", index=False)