<a href="https://colab.research.google.com/github/alicezil/38615-Lab-1/blob/main/Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 38615 Lab 1: Exploratory Data Anlysis

You will have to perform EDA analysis:

*   Load data, prepare for analysis, process if necessary
*   Analyze types of data
*   Find and process missing and erroneous features
*   Find outliers (if any)
*   Find highly correlated variables (if any).
*   Find if the target variable is correlated with any features.
*   Use PCA to plot data in 2D and color code by the target property. Do you    see any patterns?
*   Prepare a short write-up describing your processing technics and choices above.

Bonus Qs:
*   Use any non-linear dimensionality reduction method. Plot data in 2D and color code by the target property. Compare observed picture with PCA.
*   Surprise me! Uncover hidden patterns and find non-trivial relationships in the data

# 1. Loading and Preparing Data for Analysis

1.1 Importing the necessary libraries:

In [26]:
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import manifold
from sklearn.decomposition import PCA

%matplotlib inline 
sns.set(color_codes=True)

1.2 Loading the data into the data frame:

In [27]:
df = pd.read_csv("/content/lab1_dataset.csv")
df.shape

(2000, 552)

1.3 Basic data preperation:

In [28]:
df = df.drop_duplicates()    # removing duplicate rows if there are any
df = df.dropna()             # dropping the missing values

df.shape

(1900, 552)

1.4 Removing outliers:

In [29]:
#getting a summary table that includes mean and standard deviation
df_summary = df.describe(include = 'all')

#using the standard deviation to remove outliers (where outlier = 5 standard devs away from the mean)
for col in df:
  std = df_summary[col]['std']
  mean = df_summary[col]['mean']
  if df[col].dtypes == 'int64' or df[col].dtypes == 'float62':
    for i in df.index:
      if ((df[col][i] < (mean - 5*std)) or (df[col][i] > (mean + 5*std))):
        df = df.drop([i])

df.shape

(1741, 552)

#2. Finding Correlation in Data

2.1 Constructing a correlation matrix

In [31]:
#create correlation matrix
corr_matrix = df.corr().abs()

2.2 Removing highly correlated data

In [32]:
#isolate upper triangle (for lack of repetition)
upper_triangle = corr_matrix.where((np.triu(np.ones(corr_matrix.shape), k=1) + 
                           np.tril(np.ones(corr_matrix.shape), k=-1)).astype(bool))

#make a list of columns with correlation larger than .97
drop_list = []
for col in upper_triangle.columns:
  if any(upper_triangle[col] > 0.97):
    drop_list.append(col)

#drop all the columns from the list
df.drop(drop_list, axis=1, inplace=True)

df.shape

(1741, 242)

#3. Establish Target Variable and Prepare It for Analysis

3.1 Establishing a target variable

In [33]:
#show top of the data
df.head()

Unnamed: 0,experimental_proprty,MS_enc,nHetero,nX,C2SP3,nFAHRing,AATS3d,nHBDon,nAcid,PEOE_VSA8,...,ATSC6d,ATS5dv,NsssCH,AATSC2v,PEOE_VSA12,IC1,EState_VSA5,AATSC1dv,nFARing,n6Ring
0,3.54,PPENPINEAPLE42,5,1,0,0,3.127273,0,0,43.936717,...,-4.444444,222.888889,0,4.021507,0.0,3.133948,11.204087,1.102539,0,3
1,-1.18,PPENPINEAPLE42,11,0,3,1,3.477273,2,1,12.108208,...,-9.756392,736.333333,1,4.393094,5.90718,3.847419,30.657545,0.854039,1,3
2,3.69,PPENPINEAPLE42,5,1,2,1,3.4,0,0,22.989293,...,-3.103725,327.444444,1,5.398449,0.0,3.524624,17.550396,0.793417,1,2
3,3.37,HTXPTDWTTWOBJR,9,1,3,1,3.309735,4,0,17.494432,...,-5.502836,582.777778,2,-0.991896,5.90718,4.110093,16.236696,0.421656,1,2
4,3.1,PPENPINEAPLE42,7,0,2,0,2.816,2,0,18.883484,...,0.297521,510.0,1,7.240313,5.90718,3.555674,4.681803,1.60281,0,1


The 'experimental_proprty' column sounds like it may be our target variable. Let's get more info on it:

In [34]:
#summary of possible target variable column
df['experimental_proprty'].describe()

count    1741.000000
mean        2.152068
std         1.210546
min        -1.500000
25%         1.350000
50%         2.330000
75%         3.090000
max         4.500000
Name: experimental_proprty, dtype: float64

It seems like the range is quite small since the min is -1.5 and the max is 4.5. If you look at the quartiles they all have a difference of approximately 1 which leads me to believe that if we cast the column to integers instead of floats we can have approximetely 6 sections -1, 0, 1, 2, 3, and 4. This will make the analysis neater and more understandable.