# Project Template: Phase 1

Below are some concrete steps that you can take while doing your analysis. This guide isn't "one size fit all" so you will probably not do everything listed. But it still serves as a good "pipeline" for how to do data analysis.

If you do engage in a step, you should clearly mention it in the notebook.

---


In [1]:
import pandas as pd
import numpy as np
import sklearn
import json
import warnings
warnings.filterwarnings('ignore')

## Loading Data

In the cells below, make sure to do the following:

1. Load your dataset. If your dataset contains multiple files (e.g. AirBnB), make sure to merge them.
2. Decide what attribute you want to predict (you can change your mind later during EDA if needed).

### Types of Attributes

Below are some examples of types of attributes you may encounter. For some types of data, ML algorithms cannot use them directly, so we have to encode them somehow. We have summarized ways to deal with these non-traditional data types. There are more examples in Follow content for this week.

Make a note of which types of data your dataset has.

Traditional data types (individual numbers/values): no transformation needed

  * Nominal
    * Binary
  * Ordinal
  * Interval
  * Ratio
  * Continuous
  * Discrete

Non-traditional data types:
 
* **Text**
    * Encode with: [Bag of Words](https://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation), [TF-IDF](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting), Embeddings
* **Sets** (e.g. tags {"Blog", "Video", "Finance Article"})
    * We should not treat these like  bag of words, since tags can be multi word
    * We can use [one hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
* **Time series data**: A series of numbers, e.g. predict the stock price next year from the last N years of prices.
    * Naive approaches: We can use each of the following as a separate feature:
        * Last value: Use the last value in the series.
        * Average, Median: Use the average or median of the values.
        * Max/min: Use the max and min of the values.
    * A more effective approach is to use a ML model that can take time-series data in as an input, such as an [Long Short-Term Memory](https://en.wikipedia.org/wiki/Long_short-term_memory) model, but these are out of scope for this course.
* **Numeric Data** that isn't directly interpretable (e.g. geospatial data)
    * This varies from situation to situation. Sometimes your data is numeric but isn't directly predictive of your class label (e.g. latitude and longitude; movie title). However, you may be able to combine this with other datasets to construct more meaningful features (e.g. State, Movie Genre).

In [2]:
df = pd.read_csv("./train.csv")

In [3]:
df.shape

(6593274, 9)

The attribute I will predict is: **Target**

## Exploratory Data Analysis (EDA)

Using some of the techniques in the "Follow" document, explore your dataset. Then answer the following questions (you don't have to solve the problems yet - just be aware of them):

1. What attribute are you predicting and what type of supervised learning is this?
    * Binary Classification: Just 2 class labels
    * Multi-class classification: More than 2 class labels
    * Regression: A continuous variable
    * Ordinal classification: Predicting an ordinal value, e.g. a rating on a 5-star scale 
        * This is tricky! Do you want to change this into regression or binarize your variable to make this binary classification?

2. Do you need to perform feature selection?
    * E.g. do you have hihgly correlated features?

3. Do you have any non-traditional attributes (see the list above)? If so how will you encode them? (You don't have to do it yet.)

4. If you are doing classification, are your class labels balanced (similar numbers of instances from each class)?

5. If you are doing regression, how is your dependent variable distributed (e.g. normally, skewed)?
    
6. Do any of your features need transformation (e.g. because they have a skewed distribution)?


In [4]:
df.head()

Unnamed: 0,Station,Ob,value,measure,target,R_flag,I_flag,Z_flag,B_flag
0,AURO,1/2/2021 0:30,19.2,temp_wxt,False,2,-1,0,1
1,AURO,1/2/2021 4:30,19.8,temp_wxt,False,2,-1,0,1
2,AURO,1/2/2021 5:30,19.5,temp_wxt,False,2,-1,0,1
3,AURO,1/2/2021 7:30,18.5,temp_wxt,False,2,-1,0,1
4,AURO,2/16/2021 2:30,17.7,temp_wxt,False,2,-1,0,1


In [5]:
UniqueStations = df.Station.unique()
DataFrameDict = {e : pd.DataFrame for e in UniqueStations}
stationdfDict = {e : pd.DataFrame for e in UniqueStations}

for key in DataFrameDict.keys():
    DataFrameDict[key] = df[:][df.Station == key]
    fileName = key + "_2021.csv"
    full = "./full/" + fileName
    stationdfDict[key] = pd.read_csv(full)
    
DataFrameDict['AURO'].head()

Unnamed: 0,Station,Ob,value,measure,target,R_flag,I_flag,Z_flag,B_flag
0,AURO,1/2/2021 0:30,19.2,temp_wxt,False,2,-1,0,1
1,AURO,1/2/2021 4:30,19.8,temp_wxt,False,2,-1,0,1
2,AURO,1/2/2021 5:30,19.5,temp_wxt,False,2,-1,0,1
3,AURO,1/2/2021 7:30,18.5,temp_wxt,False,2,-1,0,1
4,AURO,2/16/2021 2:30,17.7,temp_wxt,False,2,-1,0,1


In [6]:
for key in DataFrameDict.keys():      
    #temp = pd.concat([DataFrameDict[key], stationdfDict[key]], axis=1, keys="Ob", join="inner")
    temp = DataFrameDict[key].merge(stationdfDict[key], on= "Ob", how='inner') 
    stationdfDict[key] = 0
    DataFrameDict[key] = temp

DataFrameDict['AURO'].head()

Unnamed: 0.1,Station_x,Ob,value,measure,target,R_flag,I_flag,Z_flag,B_flag,Unnamed: 0,...,sm,temp10,ws02,wd02,gust02,ws06,wd06,gust06,leafwetness,blackglobetemp
0,AURO,1/2/2021 0:30,19.2,temp_wxt,False,2,-1,0,1,1470,...,0.516,20.0,2.9,182.0,3.3,3.484,187.1,6.096,389.0,19.1
1,AURO,1/2/2021 0:30,0.516,sm,False,0,-1,-1,1,1470,...,0.516,20.0,2.9,182.0,3.3,3.484,187.1,6.096,389.0,19.1
2,AURO,1/2/2021 4:30,19.8,temp_wxt,False,2,-1,0,1,1710,...,0.516,19.84,0.8,198.0,1.2,1.64,214.0,2.626,332.8,19.46
3,AURO,1/2/2021 4:30,0.516,sm,False,0,-1,-1,0,1710,...,0.516,19.84,0.8,198.0,1.2,1.64,214.0,2.626,332.8,19.46
4,AURO,1/2/2021 4:30,19.84,temp10,False,0,-1,-1,0,1710,...,0.516,19.84,0.8,198.0,1.2,1.64,214.0,2.626,332.8,19.46


In [7]:
full = DataFrameDict['AURO']

for key in DataFrameDict.keys(): 
    if key != 'AURO':
        full = pd.concat([full, DataFrameDict[key]])
        
full.shape

(6593274, 34)

In [8]:
#compression_opts = dict(method='zip', archive_name='full.csv')  
#full.to_csv('full.zip', index=False, compression=compression_opts)  
full.to_csv('full.csv', index=False, header=True)

In [9]:
DataFrameDict = 0

Answer the questions below:

1. 
2. 
3. 
4. 
5. 
6. 

## Preprocessing

Do the following steps on your data (and any others you think are needed). See the "Follow" examples, as well as the original Week 2 materials for more on how to do each step.

1. **Feature Cleaning**: Remove meaningless features (e.g. IDs), or unfair features that make the problem too easy (e.g. percent grade should be removed if predicing final letter grade).
2. **Feature Discretization**: Discretize any attributes that should be discretized.
3. **Feature Transformation** Encode non-standard features into usable formats (standardize dates, vectorize words). Tranform any features (e.g. using a log-transform) as needed.
4. **Feature Selection**: Remove redundant, noisy features or unhelpful features.
5. **Aggregation**: If your data has multiple rows per class label, transform it so that there is only one row per class label.

Now revisit EDA as needed to see what your transformed dataset looks like.

If you don't need to do a given step, just skip it an exaplin why.

### 1) Feature cleaning

### 2) Feature discretization

### 3) Feature transformation


### 4) Feature selection

### 5) Aggregation

### Revisiting EDA