# EDA (Exploratory Data Analysis)

- Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques and statistical techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations.

**Steps involved in EDA**
1. Describing the data
2. Data cleaning
3. Imputation techniques
4. Data analysis and visualization
5. Transformations
6. Auto EDA

**Import data and data description**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10,5)
plt.rcParams['figure.dpi'] = 300
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/aishwaryamate/Machine-Learning/main/EDA-1/data_clean.csv', index_col=0)
df

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
1,41.0,190.0,7.4,5,1,2010,67,S
2,36.0,118.0,8.0,5,2,2010,72,C
3,12.0,149.0,12.6,5,3,2010,74,PS
4,18.0,313.0,11.5,5,4,2010,62,S
5,,,14.3,5,5,2010,56,S
...,...,...,...,...,...,...,...,...
154,41.0,190.0,7.4,5,1,2010,67,C
155,30.0,193.0,6.9,9,26,2010,70,PS
156,,145.0,13.2,9,27,2010,77,S
157,14.0,191.0,14.3,9,28,2010,75,S


In [3]:
df.describe()

Unnamed: 0,Ozone,Solar.R,Wind,Day,Year,Temp
count,120.0,151.0,158.0,158.0,158.0,158.0
mean,41.583333,185.403974,9.957595,16.006329,2010.0,77.727848
std,32.620709,88.723103,3.511261,8.997166,0.0,9.377877
min,1.0,7.0,1.7,1.0,2010.0,56.0
25%,18.0,119.0,7.4,8.0,2010.0,72.0
50%,30.5,197.0,9.7,16.0,2010.0,78.5
75%,61.5,257.0,11.875,24.0,2010.0,84.0
max,168.0,334.0,20.7,31.0,2010.0,97.0


In [4]:
df.describe(include=object)

Unnamed: 0,Month,Weather
count,158,155
unique,6,3
top,9,S
freq,34,59


In [5]:
df.dtypes

Ozone      float64
Solar.R    float64
Wind       float64
Month       object
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 158 entries, 1 to 158
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Ozone    120 non-null    float64
 1   Solar.R  151 non-null    float64
 2   Wind     158 non-null    float64
 3   Month    158 non-null    object 
 4   Day      158 non-null    int64  
 5   Year     158 non-null    int64  
 6   Temp     158 non-null    int64  
 7   Weather  155 non-null    object 
dtypes: float64(3), int64(3), object(2)
memory usage: 11.1+ KB


- As we can see that,
- 'Month' column has all the numeric data still python has shown us that it is object.
- We will have to find out the reason and change the data type of the column.

# **Data type conversion**

In [7]:
df['Month']

1      5
2      5
3      5
4      5
5      5
      ..
154    5
155    9
156    9
157    9
158    9
Name: Month, Length: 158, dtype: object

In [8]:
df['Month'].unique()

array(['5', 'May', '6', '7', '8', '9'], dtype=object)

In [9]:
df['Month'].value_counts()

9      34
5      31
7      31
8      31
6      30
May     1
Name: Month, dtype: int64

In [12]:
df['Month'].replace('May','5', inplace=True)

In [13]:
df['Month'].unique()

array(['5', '6', '7', '8', '9'], dtype=object)

In [14]:
df.dtypes

Ozone      float64
Solar.R    float64
Wind       float64
Month       object
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [21]:
df.head()

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
1,41.0,190.0,7.4,5,1,2010,67,S
2,36.0,118.0,8.0,5,2,2010,72,C
3,12.0,149.0,12.6,5,3,2010,74,PS
4,18.0,313.0,11.5,5,4,2010,62,S
5,,,14.3,5,5,2010,56,S


In [19]:
df['Month'].astype(int)

1      5
2      5
3      5
4      5
5      5
      ..
154    5
155    9
156    9
157    9
158    9
Name: Month, Length: 158, dtype: int32

In [20]:
df.dtypes

Ozone      float64
Solar.R    float64
Wind       float64
Month       object
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [23]:
df['Month'] = df['Month'].astype(int)

In [24]:
df.dtypes

Ozone      float64
Solar.R    float64
Wind       float64
Month        int32
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [25]:
df.head()

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
1,41.0,190.0,7.4,5,1,2010,67,S
2,36.0,118.0,8.0,5,2,2010,72,C
3,12.0,149.0,12.6,5,3,2010,74,PS
4,18.0,313.0,11.5,5,4,2010,62,S
5,,,14.3,5,5,2010,56,S


In [26]:
df['Wind'].replace([7.4,8.0],['A','B'])

1         A
2         B
3      12.6
4      11.5
5      14.3
       ... 
154       A
155     6.9
156    13.2
157    14.3
158       B
Name: Wind, Length: 158, dtype: object

In [27]:
df.head()

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
1,41.0,190.0,7.4,5,1,2010,67,S
2,36.0,118.0,8.0,5,2,2010,72,C
3,12.0,149.0,12.6,5,3,2010,74,PS
4,18.0,313.0,11.5,5,4,2010,62,S
5,,,14.3,5,5,2010,56,S


# Duplicates

In [None]:
df

In [None]:
df.duplicated()

In [None]:
df.duplicated().sum()

In [None]:
#Print the duplicated values

In [None]:
df[df.duplicated()]

In [None]:
df.duplicated(keep = False)

In [None]:
df[df.duplicated(keep=False)]

In [None]:
#Drop Duplicated records

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

In [None]:
df.duplicated().sum()

# Drop columns

- There is one column which only has single different value.
- We need to get rid of all the uneccesary columns or else it will be very complex data.

In [None]:
df.head()

In [None]:
df.drop(columns=['Year'], inplace=True)

In [None]:
df.head()

# Rename the columns

In [None]:
df.head()

In [None]:
df.rename(columns={'Solar.R':'Solar','Temp':'Temperature'},inplace=True)

In [None]:
df

# Missing value imputation

- In dataset, missing data, or missing values, occur when no data value is stored for the variable in an observation.
- Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
- If we don't impute or handle null values, we will not be able to create a machine learning model as python does not understand missing values.
- Handling missing values is a crucial step in EDA.
- Missing values can appear for several reasons like:
    - Incomplete data entry
    - Issues with machines
    - Improper handling of data
    - And many more.
         

In [None]:
#Checking null values

In [None]:
df.head()

In [None]:
df.isna() #is.null()

In [None]:
df.isna().sum()

In [None]:
df.isnull().sum()

In [None]:
#Visualizing missing values

In [None]:
sns.heatmap(df.isna())

In [None]:
df.isna().sum()

In [None]:
#Calculate the percentage of missig values in each column.
for i in df.isna().sum():
    print((i/len(df))*100)

# Rule for missing value imputation:
1. If NA values are between 1 to 5%, drop na (rows)
2. If NA value are 5 to 40%, fillna(replace)
3. If NA values are greater than 50% in a column, drop that entire column.

In [None]:
df.head()

In [None]:
df.hist()
plt.tight_layout()

In [None]:
ozone_median = df['Ozone'].median()
ozone_median

In [None]:
df['Ozone'].fillna(ozone_median, inplace=True)

In [None]:
df.isna().sum()

In [None]:
df['Solar'].mean()

In [None]:
df['Solar'].fillna(df['Solar'].mean(), inplace=True)

In [None]:
df.isna().sum()

In [None]:
df['Weather'].value_counts()

In [None]:
df['Weather'].mode()

In [None]:
df['Weather'].mode()[0]

In [None]:
df['Weather'].fillna(df['Weather'].mode()[0], inplace=True)

In [None]:
df.isna().sum()

# Outlier detection

- There are multiple approaches to detect outliers in the dataset.
    - Histogram
    - Boxplot
    - Descriptive stats for df

In [None]:
df.describe()

In [None]:
df.hist()
plt.tight_layout()

In [None]:
df.boxplot()

In [None]:
sns.boxplot(x = df['Ozone'])

In [None]:
#Outlier detection function

In [None]:
def outlier_detection(data,colname):
    q1 = data[colname].quantile(0.25)
    q3 = data[colname].quantile(0.75)
    iqr = q3-q1
    
    upper = q3+(1.5*iqr)
    lower = q1-(1.5*iqr)
    
    return lower,upper

In [None]:
outlier_detection(df, 'Ozone')

In [None]:
outlier_detection(df,df.columns)

In [None]:
df[df['Ozone']>81.0]

In [None]:
len(df[df['Ozone']>81.0])

In [None]:
df.loc[df['Ozone'] > 81.0,'Ozone']

In [None]:
#Capping Outliers

In [None]:
df.loc[df['Ozone']>81.0,'Ozone'] = 81.0

In [None]:
sns.boxplot(x = df['Ozone'])

In [None]:
outlier_detection(df, 'Wind')

In [None]:
sns.boxplot(x = df['Wind'])

In [None]:
df[df['Wind'] > 17.65]

In [None]:
# Selecting only normal data points 
# data = df[df['Wind'] < 17.65]
# data 

In [None]:
df.loc[df['Wind']>17.65,'Wind'] = 17.65

In [None]:
df.boxplot()

# Scatter Plot and Correlation

In [None]:
sns.pairplot(df)

In [None]:
#Correlation coefficient

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr(),annot=True, cmap='viridis')

# Transformations

- Machines do not understand words and sentences.
- Machines only understand numbers.
- Before model building, we have convert all the categorical columns into numerical.

In [None]:
df.head()

In [None]:
#Encoding using pandas dummy function.

In [None]:
df = pd.get_dummies(data=df,columns= ['Weather'])

In [None]:
df

# **Scaling the data**

- Normalization
    - Scales value within the range of 0 to 1.
- Standardization
    - Uses Z score for scaling.
    - Scale values in such a way that the mean = 0 and standard deviation is 1.

In [None]:
from sklearn.preprocessing import StandardScaler,MinMaxScaler

In [None]:
#Standardization
sc = StandardScaler()

In [None]:
df.describe()

In [None]:
df.head(2)

In [None]:
sc.fit_transform(df)

In [None]:
df.columns

In [None]:
scaled_data = pd.DataFrame(sc.fit_transform(df), columns=df.columns)
scaled_data

In [None]:
scaled_data.describe()

In [None]:
#MinMaxScalar

In [None]:
mn = MinMaxScaler()

In [None]:
df.columns

In [None]:
minmax_scaled = pd.DataFrame(mn.fit_transform(df), columns=df.columns)
minmax_scaled

In [None]:
minmax_scaled.describe()