# Data Analysis

Data Analysis is the step in Data Science cycle which help you have an insight into the data before performing any Machine learning.  

It can help extract more knowledge on our features (columns) and further on formulate the business problem.  
eg: - which is the most relevant/irrelevant feature to the target (output)  
 what kind of analysis should we go for (preditive, regressive)  
 How should be ```target segmentated``` ?  
 What can we say about the data reliability? Seasonality?

In [None]:
import os
import pandas as pd 

# Viz libraries
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
path = "/content/drive/MyDrive/Colab_Notebooks/orness/clean_data/"

In [None]:
df = pd.read_csv(path+"winequality.csv")

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
# Making sure if we have no null values
df.isnull().sum()

In [None]:
# to get the dtypr and non-null count info at once
df.info()

## Basic terminologies

**Mean (Average)**: Perhaps the most familiar one of all. Just add up all the sample values for a given feature, then divide it by the number of samples.  
eg:  [1,2,3,4,5] = (1+2+3+4+5)/5 = 3

**Median**: First you arrange all the sample values in numerical order, in a list. The middle number in this list will be the median.  
eg:  [1,2,3,4,5] = 3  

**Mode**: The value that occurs the most in a list of samples.  

**Range**: The difference between the highest value and the lowest values in a list.  
eg: [1,2,3,4,5] = 5-1 -> 4

**Distribution**: Describes how values are distributed for a field. In other words, the statistical distribution shows which values are common and uncommon  

![distribution](https://drive.google.com/uc?id=126rLjawDjArg8HV1uHvmhnHrss74QiXD)

**Correlation**: Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate)  

Correlation values ranges between [-1,1]  

x = y (perfectly correlated)  
x = -y (negatively correlated)

![correlation](https://drive.google.com/uc?id=14P4gutzW696ZDwHBR1W_yL7EBcUZT1sw)

In [None]:
# performing the value counts of the target column
df['TARGET'].value_counts()

Lets look at the target distribution of our wine.  

We can use the ```hist``` function on the dataframe to plot our final distribution 

In [None]:
df.hist(column='TARGET')
# how to know what function to apply ? 

Target Segmentation:  
```greater than 5 is good``` and ```lesser than 5 is bad``` quality wine -- EASY  
```3 & 4 is bad```, ```5 & 6 is average``` and ```7 & 8 is good``` quality wine -- AVERAGE  
```3 & 4 is bad```, ```5 is average```, ```6 & 7 is good``` and ```8 is the best``` quality wine -- HARD


In [None]:
df.corr()

One of the skillset of Data Analyst is having the domain expertise. i.e. to know what exactly each feature mean. 

## Features

Looking at all the column names and what they mean:  
**Fixed acidity**: Volatile acids of wine  
**Volatile_acidity**: The amount of acetic acid in wine  
**Citric Acid**: Adds flavor to wine (found in small quantity)  
**Residual sugar**: Sugar content after the fermentation process  
**Chlorides**: Residual salt in wines  
**Free sulphur dioxide**: The free form of SO2 (it prevents microbial growth and the oxidation of wine)  
**Total sulfur dioxide**: Amount of free and bound forms of SO2  
**Density**: The density of substance  
**PH**: Describes how acidic or basic a substance is on a scale from 0 to 14  
**Sulfates**: A wine additive contributing to SO2 levels  
**Alcohol**: Percentage of alcohol content  
**Quality**: Wine rating on the scale of 0 to 10 

## Correlation

In [None]:
correlation = df.corr()
plt.figure(figsize=(14, 12))
heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")

Lets look at each of the feature and see their relation with the wine quality.  

For thst we'd have to plot each feature against the ```TARGET```

Lets look at the box plot before we get any further:  

![whisker_plot](https://drive.google.com/uc?id=12alHShTVMI4o2-fgN3fWQKl8rYrmQCo7)

## Feature Relevancy

In [None]:
cols = df.columns[:-1]
for i in cols:
  sns.boxplot(x = 'TARGET', y = i, data = df, palette='coolwarm')
  plt.show()

In [None]:
# Could drop the irrelavant feature using drop
df_drop = df.drop('residual_sugar', axis=1)

In [None]:
# converting to binary classification problem
Y = df['TARGET'].apply(lambda y: 1 if y > 5 else 0)

In [None]:
#value counts on Y?