# Homework 3
In this homework assignment, you will begin to explore the [SWAN-SF Dataset](https://doi.org/10.7910/DVN/EBCFKM). 


Below you will find a number of steps that you will be required to complete before you can start the assignment.

---

## Step 1: Downloading the Data

Download individual partitions of the dataset through the following links:
- [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB)
- [Partition 2](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/TCRPUD)
- [Partition 3](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/PTPGQT)
- [Partition 4](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/FIFLFU)
- [Partition 5](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/QC2C3X)

This assignment will only be using [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB), but we will be using more than one by the end of the semster. In later steps, you will need to access the uncompressed files from these partitions, so remember where you put them.

A paper describing the construction of the dataset can be found [here](https://doi.org/10.1038/s41597-020-0548-x).

---

### Dataset Attributes:

Each file in the dataset contains the following attributes as a single variate of the multivariate timeseries sample. 

|              |                  |             |
|--------------|------------------|-------------|
| 1. Timestamp | 2. TOTUSJH       | 3. TOTBSQ   |	
| 4. TOTPOT	   | 5. TOTUSJZ       | 6. ABSNJZH  |	
| 7. SAVNCPP   | 8. USFLUX        | 9. TOTFZ	|
| 10. MEANPOT  | 11. EPSZ	      | 12. MEANSHR |
| 13. SHRGT45  | 14. MEANGAM      | 15. MEANGBT |
| 16. MEANGBZ  | 17. MEANGBH      | 18. MEANJZH |
| 19. TOTFY    | 20. MEANJZD      | 21. MEANALP |	
| 22. TOTFX    | 23. EPSY	      | 24. EPSX	|
| 25. R_VALUE  | 26. CRVAL1       | 27. CRLN_OBS|	
| 28. CRLT_OBS | 29. CRVAL2       | 30. HC_ANGLE|	
| 31. SPEI     | 32. LAT_MIN      | 33. LON_MIN |
| 34. LAT_MAX  | 35. LON_MAX      | 36. QUALITY |	
| 37. BFLARE   | 38. BFLARE_LABEL |	39. CFLARE  |	
| 39. CFLARE_LABEL | 40. MFLARE | 41. MFLARE_LABEL |	
| 42. XFLARE | 43. XFLARE_LABEL | 44. BFLARE_LOC |	
| 45. BFLARE_LABEL_LOC | 46. CFLARE_LOC | 47. CFLARE_LABEL_LOC |	
| 48. MFLARE_LOC | 49. MFLARE_LABEL_LOC | 50. FLARE_LOC |	
| 51. XFLARE_LABEL_LOC | 52. XR_MAX | 53. XR_QUAL |	
|54. IS_TMFI | | |

---


## Step 2: Setting up your Environment

You will be using the [MVTS Data Toolkit v0.2.6](https://bitbucket.org/gsudmlab/mvtsdata_toolkit) library developed by the [Dataming Lab](https://dmlab.cs.gsu.edu/) at GSU to generate features from this dataset. 

This tool requires a different version of libraries than were installed when we put Anaconda on your computer at the beginning of the semster.  To get around this, we will be creating an environment specifically for use with this library.

--- 

An environment file was included in the archive given to you for your assignment. Use it to create an conda envronment using the following command

    conda env create -f flare_env.yml
    
Then switch to the newly created envronment using the following command

    conda activate flare_env
    
Then install the [MVTS Data Toolkit v0.2.6](https://bitbucket.org/gsudmlab/mvtsdata_toolkit) library as follows

    pip install mvtsdatatoolkit
    
Assuming you have navigated to where this assignment notebook file is, you will need restart jupyter using your newly created environment

    jupyter notebook

Anaconda provides this mechanism to allow you to manage multiple environments with different versions of libraries installed.  Each time you wish to start using this environment you will need to activate it again. You can read more about managing environments in Anaconda ([here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)).

---

Documentation of the various methods available through the library can be found ([here](https://dmlab.cs.gsu.edu/docs/mvtsdata_toolkit/)).

A tutorial on how to use the library can be found ([here](https://mybinder.org/v2/git/https%3A%2F%2Fbitbucket.org%2Fgsudmlab%2Fmvtsdata_toolkit%2Fsrc%2Fmaster/master?filepath=.%2Fdemo.ipynb)).

---

## Step 3: Renaming files to fit the library requirements


The [MVTS Data Toolkit v0.2.6](https://bitbucket.org/gsudmlab/mvtsdata_toolkit) requires the Multi-Variate Timeseries files to be labled a specific way for it to read them and assign a label to them when processing. Below is a method I have written for you to do this process automatically. 


In [1]:
from os import listdir
from shutil import copyfile
from os.path import isfile, isdir, join, exists 
from os import makedirs

In [2]:
def renameAndCopy(file_list, source_dir, dest_dir):
    if not exists(dest_dir):
        makedirs(dest_dir)
    for f in file_list:
        fileOut = "lab[{0}]_id[{1}]_st[{2}]_et[{3}].csv"
        idx = f.find('@')
        idx4 = f.find('_ar')
        idx1 = f.find('_s')
        idx2 = f.find('_e')
        idx3 = f.find('.csv')
        
        startTime = f[idx1+2:idx2]
        endTime = f[idx2+2:idx3]
        idVal = f[idx4+3:idx1]
        if(idx == -1):
            fileOut = fileOut.format('NF', idVal, startTime, endTime)
        else:
            label = f[0:1]
            fileOut = fileOut.format(label, idVal, startTime, endTime)
        copyfile(join(source_dir, f), join(dest_dir, fileOut))


You will need to change the directories to fit your computer, but this is enough to get you started.

### I have changed the partitions to just 1 as my system was crashing while executing with the heavy loaded data.

In [4]:
partitions = ["partition1"] #, "partition2", "partition3", "partition4", "partition5"]
baseSrcDir = "{0}/" #This line needs to be changed to where you have your partitions stored
baseDestDir = "dest_dir/{0}/" #This line needs to be changed to where you want to store the renamed files

#This loop processes all of the flaring and nonflaring data from each parition 
#It copies the files to your processed files directory using the naming convention required by the toolkit library
#This only needs to be executed once
for p in partitions:
    
    path_to_partition = baseSrcDir.format(p)
    path_to_renamed_files = baseDestDir.format(p)
    
    flareDir = join(path_to_partition, 'FL')
    nonflareDir = join(path_to_partition, 'NF')
    
    flareFilesList = [f for f in listdir(flareDir) if isfile(join(flareDir, f))]
    renameAndCopy(flareFilesList, flareDir, path_to_renamed_files)
    nonflareFilesList = [f for f in listdir(nonflareDir) if isfile(join(nonflareDir, f))]
    renameAndCopy(nonflareFilesList, nonflareDir, path_to_renamed_files)

Using the provided config file (flare_hw_3_config.yml) for the [MVTS Data Toolkit v0.2.6](https://bitbucket.org/gsudmlab/mvtsdata_toolkit) library to read MVTS Parameters 2 through 25 of the dataset above. 

You will need to configure the FeatureExtractor object of the library and use it to calculate the features from the time series of these parameters in [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB) of the dataset. 

**Note:** The provided config file has a directory in it that lists where to look for the data. You need to edit this to match your directory structure.  

---

Save the extracted features to a CSV file in some location so you can use them at a later time.  The rest of the assignment requires these features as input.  

A useful method to save with is the [pandas.to_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) method.

**Note:** Because of how slow this method is, use the "first_k=50" parameter for the method calls and continue by writing code to store this data as though you completed processing the entire partition. I will provide a link below for an already completed dataset for you to use on the remainder of the assignment. 

In [7]:
from mvtsdatatoolkit.features.feature_extractor import FeatureExtractor

ModuleNotFoundError: No module named 'mvtsdatatoolkit'

In [36]:
#Place your answer here, maybe even add a few more cells to break up your work into parts.
path_to_config = 'flare_hw_3_config.yml'

In [38]:
from mvtsdatatoolkit.data_analysis.mvts_data_analysis import MVTSDataAnalysis

mvda = MVTSDataAnalysis(path_to_config)
mvda.print_stat_of_directory()

----------------------------------------
Directory:			./dest_dir/partition1/
Total no. of files:	73492
Total size:			-1158561959B
Total average:		41K
----------------------------------------


In [39]:
fe = FeatureExtractor(path_to_config)
fe.do_extraction(first_k=50)
fe.df_all_features

Unnamed: 0,id,lab,st,et,TOTUSJH_min,TOTUSJH_max,TOTUSJH_median,TOTUSJH_mean,TOTUSJH_stddev,TOTUSJH_var,...,R_VALUE_gderivative_kurtosis,R_VALUE_linear_weighted_average,R_VALUE_quadratic_weighted_average,R_VALUE_average_absolute_change,R_VALUE_average_absolute_derivative_change,R_VALUE_last_value,R_VALUE_slope_of_longest_mono_increase,R_VALUE_slope_of_longest_mono_decrease,R_VALUE_avg_mono_increase_slope,R_VALUE_avg_mono_decrease_slope
0,104,B,2010-07-29T17_36_00,2010-07-30T05_24_00,829.154453,1135.912337,1093.211361,1089.53285,38.701149,1497.778905,...,0.437955,4.759565,4.74042,0.016301,0.025893,4.72528,0.02156,-0.000522,0.01053,-0.009819
1,104,B,2010-07-29T18_36_00,2010-07-30T06_24_00,829.154453,1135.912337,1089.432718,1086.850782,39.068763,1526.368258,...,-0.959347,4.745548,4.726773,0.014959,0.023981,4.667606,0.02156,-0.000522,0.009391,-0.009614
2,104,B,2010-07-29T19_36_00,2010-07-30T07_24_00,829.154453,1135.912337,1086.399339,1084.00561,39.882221,1590.591583,...,-1.026691,4.723306,4.702377,0.015567,0.02489,4.598103,0.02156,-0.000522,0.00828,-0.010223
3,104,B,2010-07-29T20_36_00,2010-07-30T08_24_00,829.154453,1135.912337,1083.873721,1081.337333,41.036746,1684.01455,...,-0.287404,4.697463,4.674044,0.016372,0.024495,4.635672,0.02156,-0.000522,0.009067,-0.010394
4,104,B,2010-07-29T21_36_00,2010-07-30T09_24_00,1021.408374,1137.77943,1088.104949,1087.350369,26.001976,676.10275,...,-0.765247,4.68292,4.662591,0.016314,0.024832,4.640215,0.02156,-0.000522,0.008704,-0.010349
5,104,B,2010-07-29T22_36_00,2010-07-30T10_24_00,1021.408374,1231.835028,1090.280518,1095.677514,38.124828,1453.502541,...,3.565937,4.665986,4.647344,0.017646,0.025594,4.66836,0.02848,-0.000522,0.010266,-0.01062
6,104,B,2010-07-29T23_36_00,2010-07-30T11_24_00,1021.408374,1231.835028,1095.762043,1103.789377,47.530677,2259.165235,...,0.09122,4.658485,4.64529,0.017417,0.026641,4.654956,0.02848,-0.000522,0.010344,-0.010304
7,104,B,2010-07-30T00_36_00,2010-07-30T12_24_00,1021.408374,1231.835028,1096.444144,1110.820275,53.47837,2859.936082,...,0.394719,4.650651,4.641535,0.017143,0.025832,4.653097,0.02848,-0.000522,0.009966,-0.01002
8,104,B,2010-07-30T01_36_00,2010-07-30T13_24_00,1021.408374,1231.835028,1104.87881,1117.43022,55.071901,3032.914274,...,0.446105,4.643807,4.637696,0.017883,0.026874,4.616681,0.02848,-0.000375,0.010087,-0.010192
9,104,B,2010-07-30T02_36_00,2010-07-30T14_24_00,1021.408374,1231.835028,1124.333215,1124.470871,54.781404,3001.002185,...,0.726491,4.631483,4.624413,0.017572,0.026925,4.568078,0.02848,-0.002725,0.009034,-0.010746


In [48]:
path_to_store = r'C:\Users\G.S Ramchandra\Desktop\Varchala\GSU\Classes\FDS\Homework3\dest_dir\ '
fe.df_all_features.to_csv(path_to_store+'extracted_features')

### Q2 (10 points)

Now that you have saved the extracted features to a csv file, you will load that data into a Pandas DataFrame using the [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method.  

**Note:** To help alleviate the need to compute all the features yourself, I have provided an already computed file ([here](http://dmlab.cs.gsu.edu/solar/data/partition1ExtractedFeatures.csv)).

Using this dataframe object, you should perform the simple min/max 0-1 normalization on the data. Once this is done, you should save this normalized data as a new csv file for later use.

In [49]:
import pandas as pd
from mvtsdatatoolkit.normalizing import normalizer

In [50]:
#Place your answer here, maybe even add a few more cells to break up your work into parts.

data = pd.read_csv('partition1ExtractedFeatures.csv')



In [51]:
df_norm = normalizer.zero_one_normalize(data, excluded_colnames=['id'])
df_norm

Unnamed: 0.1,Unnamed: 0,id,lab,st,et,TOTUSJH_min,TOTUSJH_max,TOTUSJH_median,TOTUSJH_mean,TOTUSJH_stddev,...,R_VALUE_gderivative_kurtosis,R_VALUE_linear_weighted_average,R_VALUE_quadratic_weighted_average,R_VALUE_average_absolute_change,R_VALUE_average_absolute_derivative_change,R_VALUE_last_value,R_VALUE_slope_of_longest_mono_increase,R_VALUE_slope_of_longest_mono_decrease,R_VALUE_avg_mono_increase_slope,R_VALUE_avg_mono_decrease_slope
0,0.000000,514,C,2011-04-28T23:12:00,2011-04-29T11:00:00,0.291283,0.298636,0.292740,0.294696,0.070926,...,0.049894,0.802111,0.805881,0.020016,0.016058,0.811280,0.018846,0.999440,0.009912,0.991168
1,0.000014,107,NF,2010-08-04T06:36:00,2010-08-04T18:24:00,0.006735,0.017104,0.008302,0.009188,0.024988,...,0.567005,0.230784,0.304068,0.068847,0.058740,0.499631,0.406374,0.967035,0.148119,0.943999
2,0.000027,798,C,2011-08-21T05:12:00,2011-08-21T17:00:00,0.142235,0.152566,0.145983,0.146485,0.056311,...,0.129826,0.731458,0.740778,0.042363,0.033872,0.745044,0.063795,0.997465,0.019984,0.980138
3,0.000041,1372,NF,2012-02-02T11:48:00,2012-02-02T23:36:00,0.017201,0.021852,0.018798,0.019088,0.013292,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
4,0.000054,475,NF,2011-04-09T01:24:00,2011-04-09T13:12:00,0.050536,0.058681,0.053915,0.054674,0.023091,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
5,0.000068,1038,NF,2011-11-09T08:12:00,2011-11-09T20:00:00,0.036069,0.045850,0.037025,0.037846,0.023401,...,0.138832,0.227643,0.271640,0.255952,0.257094,0.417876,0.606409,0.997508,0.237922,0.798281
6,0.000082,129,NF,2010-08-10T02:00:00,2010-08-10T13:48:00,0.000689,0.001864,0.001132,0.001177,0.002481,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
7,0.000095,1430,NF,2012-02-28T11:24:00,2012-02-28T23:12:00,0.005329,0.008343,0.006437,0.006499,0.006449,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
8,0.000109,937,NF,2011-10-09T13:36:00,2011-10-10T01:24:00,0.005704,0.008415,0.007503,0.007513,0.005715,...,0.137251,0.050779,0.032008,0.164876,0.170611,0.000000,0.593836,0.989727,0.458353,0.754494
9,0.000122,1302,NF,2012-01-12T21:00:00,2012-01-13T08:48:00,0.001423,0.004636,0.002264,0.002532,0.007447,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000


In [56]:
df_norm.to_csv('./dest_dir/norm_extracted_features.csv')

### Q3 (20 points)

Using the normalized data from question 2, perform analysis of the various features.  Find the features that have NULL/NAN values and drop the feature if more than 1% of the entries have null/nan values. For the rest, drop the specific entry that has the null/nan values.  

In [95]:
import pandas as pd
from mvtsdatatoolkit.data_analysis.extracted_features_analysis import ExtractedFeaturesAnalysis

In [96]:
norm_data = pd.read_csv('./dest_dir/norm_extracted_features.csv')

### I extract the data from 3rd column as the first two columns have unnamed data that comes due to saving the csv files along with the indexes 

In [97]:
norm_data = norm_data.iloc[:,2:]

In [98]:
norm_data

Unnamed: 0,id,lab,st,et,TOTUSJH_min,TOTUSJH_max,TOTUSJH_median,TOTUSJH_mean,TOTUSJH_stddev,TOTUSJH_var,...,R_VALUE_gderivative_kurtosis,R_VALUE_linear_weighted_average,R_VALUE_quadratic_weighted_average,R_VALUE_average_absolute_change,R_VALUE_average_absolute_derivative_change,R_VALUE_last_value,R_VALUE_slope_of_longest_mono_increase,R_VALUE_slope_of_longest_mono_decrease,R_VALUE_avg_mono_increase_slope,R_VALUE_avg_mono_decrease_slope
0,514,C,2011-04-28T23:12:00,2011-04-29T11:00:00,0.291283,0.298636,0.292740,0.294696,0.070926,0.005068,...,0.049894,0.802111,0.805881,0.020016,0.016058,0.811280,0.018846,0.999440,0.009912,0.991168
1,107,NF,2010-08-04T06:36:00,2010-08-04T18:24:00,0.006735,0.017104,0.008302,0.009188,0.024988,0.000638,...,0.567005,0.230784,0.304068,0.068847,0.058740,0.499631,0.406374,0.967035,0.148119,0.943999
2,798,C,2011-08-21T05:12:00,2011-08-21T17:00:00,0.142235,0.152566,0.145983,0.146485,0.056311,0.003201,...,0.129826,0.731458,0.740778,0.042363,0.033872,0.745044,0.063795,0.997465,0.019984,0.980138
3,1372,NF,2012-02-02T11:48:00,2012-02-02T23:36:00,0.017201,0.021852,0.018798,0.019088,0.013292,0.000184,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
4,475,NF,2011-04-09T01:24:00,2011-04-09T13:12:00,0.050536,0.058681,0.053915,0.054674,0.023091,0.000546,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
5,1038,NF,2011-11-09T08:12:00,2011-11-09T20:00:00,0.036069,0.045850,0.037025,0.037846,0.023401,0.000561,...,0.138832,0.227643,0.271640,0.255952,0.257094,0.417876,0.606409,0.997508,0.237922,0.798281
6,129,NF,2010-08-10T02:00:00,2010-08-10T13:48:00,0.000689,0.001864,0.001132,0.001177,0.002481,0.000008,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
7,1430,NF,2012-02-28T11:24:00,2012-02-28T23:12:00,0.005329,0.008343,0.006437,0.006499,0.006449,0.000045,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
8,937,NF,2011-10-09T13:36:00,2011-10-10T01:24:00,0.005704,0.008415,0.007503,0.007513,0.005715,0.000036,...,0.137251,0.050779,0.032008,0.164876,0.170611,0.000000,0.593836,0.989727,0.458353,0.754494
9,1302,NF,2012-01-12T21:00:00,2012-01-13T08:48:00,0.001423,0.004636,0.002264,0.002532,0.007447,0.000060,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000


In [99]:
#Place your answer here, maybe even add a few more cells to break up your work into parts.

analysis = ExtractedFeaturesAnalysis(norm_data, exclude=['id'])
analysis.compute_summary()
analysis.summary

Unnamed: 0,Feature-Name,Val-Count,Null-Count,mean,std,min,25th,50th,75th,max
0,TOTUSJH_min,73492.0,0,0.069838,0.122231,0.0,0.005239,0.019367,0.073697,1.0
1,TOTUSJH_max,73492.0,0,0.080933,0.131812,0.0,0.008611,0.026658,0.089033,1.0
2,TOTUSJH_median,73492.0,0,0.074532,0.125712,0.0,0.006908,0.022725,0.080179,1.0
3,TOTUSJH_mean,73492.0,0,0.074900,0.126203,0.0,0.006930,0.022870,0.080672,1.0
4,TOTUSJH_stddev,73492.0,0,0.040156,0.056785,0.0,0.007678,0.018159,0.048867,1.0
5,TOTUSJH_var,73492.0,0,0.004857,0.018114,0.0,0.000063,0.000340,0.002414,1.0
6,TOTUSJH_skewness,73492.0,0,0.470759,0.057435,0.0,0.437357,0.467601,0.499473,1.0
7,TOTUSJH_kurtosis,73492.0,0,0.035810,0.051932,0.0,0.018054,0.026229,0.038117,1.0
8,TOTUSJH_no_zero_crossings,73492.0,0,0.000351,0.012709,0.0,0.000000,0.000000,0.000000,1.0
9,TOTUSJH_mean_local_maxima_value,73492.0,0,0.074962,0.125854,0.0,0.007071,0.023053,0.080778,1.0


### The following peice of code removes the features that have more than 1% nan/null values and then drops the rows of those which have nan/null values for a given column under 1 %

In [100]:
for i in norm_data:
    nul_val = norm_data[i].isnull().sum()
    if(nul_val*100/norm_data[i].shape[0]>1):
        norm_data = norm_data.drop(i, axis=1)       
norm_data = norm_data.dropna(how = 'any')

In [101]:
norm_data

Unnamed: 0,id,lab,st,et,TOTUSJH_min,TOTUSJH_max,TOTUSJH_median,TOTUSJH_mean,TOTUSJH_stddev,TOTUSJH_var,...,R_VALUE_gderivative_kurtosis,R_VALUE_linear_weighted_average,R_VALUE_quadratic_weighted_average,R_VALUE_average_absolute_change,R_VALUE_average_absolute_derivative_change,R_VALUE_last_value,R_VALUE_slope_of_longest_mono_increase,R_VALUE_slope_of_longest_mono_decrease,R_VALUE_avg_mono_increase_slope,R_VALUE_avg_mono_decrease_slope
0,514,C,2011-04-28T23:12:00,2011-04-29T11:00:00,0.291283,0.298636,0.292740,0.294696,0.070926,0.005068,...,0.049894,0.802111,0.805881,0.020016,0.016058,0.811280,0.018846,0.999440,0.009912,0.991168
1,107,NF,2010-08-04T06:36:00,2010-08-04T18:24:00,0.006735,0.017104,0.008302,0.009188,0.024988,0.000638,...,0.567005,0.230784,0.304068,0.068847,0.058740,0.499631,0.406374,0.967035,0.148119,0.943999
2,798,C,2011-08-21T05:12:00,2011-08-21T17:00:00,0.142235,0.152566,0.145983,0.146485,0.056311,0.003201,...,0.129826,0.731458,0.740778,0.042363,0.033872,0.745044,0.063795,0.997465,0.019984,0.980138
3,1372,NF,2012-02-02T11:48:00,2012-02-02T23:36:00,0.017201,0.021852,0.018798,0.019088,0.013292,0.000184,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
4,475,NF,2011-04-09T01:24:00,2011-04-09T13:12:00,0.050536,0.058681,0.053915,0.054674,0.023091,0.000546,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
5,1038,NF,2011-11-09T08:12:00,2011-11-09T20:00:00,0.036069,0.045850,0.037025,0.037846,0.023401,0.000561,...,0.138832,0.227643,0.271640,0.255952,0.257094,0.417876,0.606409,0.997508,0.237922,0.798281
6,129,NF,2010-08-10T02:00:00,2010-08-10T13:48:00,0.000689,0.001864,0.001132,0.001177,0.002481,0.000008,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
7,1430,NF,2012-02-28T11:24:00,2012-02-28T23:12:00,0.005329,0.008343,0.006437,0.006499,0.006449,0.000045,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000
8,937,NF,2011-10-09T13:36:00,2011-10-10T01:24:00,0.005704,0.008415,0.007503,0.007513,0.005715,0.000036,...,0.137251,0.050779,0.032008,0.164876,0.170611,0.000000,0.593836,0.989727,0.458353,0.754494
9,1302,NF,2012-01-12T21:00:00,2012-01-13T08:48:00,0.001423,0.004636,0.002264,0.002532,0.007447,0.000060,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,1.000000


### Q4 (20 points)

Using the normalized and cleaned data from question 3, you now need to perform feature selection on the dataset and take the 20 most useful features for classification. For now, we will utilize all the different labels in our evaluation of features (i.e. NF, B, C, M, X).  To perform the ranking you will utilize the ANOVA F-Value to select the top 20 features and save them to a new file.

Some methods that will be useful for this operation are the methods through [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection), and the [scikit-learn f_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif) method.  



In [102]:
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [106]:
#Place your answer here, maybe even add a few more cells to break up your work into parts.
X = norm_data.iloc[:,4:]
Y = norm_data.iloc[:,1]
f_value,p_value = f_classif(X,Y)

In [110]:
sel_f = SelectKBest(f_classif, k=20)
X_train = sel_f.fit_transform(X, Y)
print(sel_f.get_support())

  f = msb / msw


[ True  True  True  True False False False False False  True  True False
 False False False False False False False False False False False False
 False  True  True False False  True False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False  True  True False False  True
 False False False False False False False False  True False False False
 False False False False False False False False False False False False
 False False False False False False False False Fa

In [113]:
X_train.shape

(73420, 20)

In [117]:
X_train = pd.DataFrame(X_train)
X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.291283,0.298636,0.292740,0.294696,0.293600,0.292806,0.293240,0.291538,0.275407,0.340770,0.337206,0.269271,0.097844,0.156977,0.110503,0.144180,0.108424,0.093636,0.105202,0.894831
1,0.006735,0.017104,0.008302,0.009188,0.009556,0.009236,0.010144,0.011024,0.016565,0.013925,0.015280,0.021521,0.006619,0.016154,0.022979,0.035237,0.021263,0.019053,0.023813,0.979909
2,0.142235,0.152566,0.145983,0.146485,0.147762,0.145611,0.147598,0.148702,0.151633,0.171533,0.172167,0.154230,0.108524,0.070042,0.073679,0.075663,0.068597,0.067716,0.079062,0.939211
3,0.017201,0.021852,0.018798,0.019088,0.019533,0.018540,0.019182,0.019200,0.016871,0.024110,0.023816,0.017819,0.011113,0.018950,0.023350,0.027455,0.022557,0.020865,0.021799,0.975956
4,0.050536,0.058681,0.053915,0.054674,0.054667,0.053769,0.054489,0.054127,0.052701,0.068163,0.067654,0.055965,0.015647,0.037632,0.038772,0.044561,0.039833,0.035986,0.040462,0.956080
5,0.036069,0.045850,0.037025,0.037846,0.038125,0.037122,0.038737,0.039376,0.041145,0.048422,0.049475,0.045010,0.012729,0.032087,0.041743,0.049345,0.039515,0.036312,0.038549,0.959029
6,0.000689,0.001864,0.001132,0.001177,0.001224,0.001058,0.001088,0.001023,0.000845,0.001140,0.001045,0.000716,0.002873,0.005532,0.008397,0.007576,0.009018,0.009102,0.008128,0.992283
7,0.005329,0.008343,0.006437,0.006499,0.006596,0.006106,0.006345,0.006260,0.005696,0.007045,0.006821,0.005405,0.010168,0.020669,0.023342,0.024553,0.022805,0.022693,0.023242,0.973151
8,0.005704,0.008415,0.007503,0.007513,0.007711,0.007262,0.007519,0.007463,0.007080,0.009507,0.009302,0.007300,0.007492,0.011611,0.015809,0.019210,0.014715,0.013636,0.015114,0.985957
9,0.001423,0.004636,0.002264,0.002532,0.002737,0.002437,0.002114,0.001973,0.001785,0.002662,0.002549,0.001873,0.006384,0.010114,0.011250,0.008005,0.011683,0.011947,0.009472,0.987459


In [118]:
X_train.to_csv('./dest_dir/X_train.csv')