In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# What is EDA?
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations.

EDA is used to get insights of our dataset.

# Libraries Required for basic EDA:
* pandas
* seaborn

<a id='toc'></a>
# Here are some basic EDA commands as a beginner you need to know:
1. [shape](#shape)
2. [head(n)](#head)
3. [info()](#info)
4. [describe()](#describe)
5. [columns](#columns)
6. [value_counts()](#value)
7. [isnull().sum()](#isnull)
8. [duplicated().sum()](#duplicated)
9. [.nunique()](#nunique)
10. [boxplot()](#boxplot)
11. [corr() and sns.heatmap()](#heatmap)
12. [Check Null values data fram through HeatMap](#nullheatmap)
13. [Check Gender base Survived through countplot](#GenderSurvivedcountplot)
14. [Count Pclass Through Countplot](#PclassCountplot)
15. [Check boxplot Pclass and Age](#boxplotPclassAge)
16. [Check countplot Gender and Pclass](#countplotGenderPclass)
17. [Random Forest Classifier Predictions](#RFCP)

In [None]:
#Load the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier


# Loading Dataset
I will be using these Eda commands on 'Titanic - Machine Learning from Disaster' dataset.

(To load dataset - click on Add data (right side) and search for desired dataset ['Titanic - Machine Learning from Disaster'] and click on add.)

In [None]:
#Load the data
train_data = pd.read_csv('../input/titanic/train.csv')
# df_test = pd.read_csv('../input/titanic/test.csv')
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
df_gender_submission = pd.read_csv('../input/titanic/gender_submission.csv')

<a id='shape'></a>
# 1. shape [**↑**](#toc)  

Return the shape/dimensionality of a Dataframe. (no of rows x no of columns)

In [None]:
train_data.shape

So, our dataset contains 891 rows (or) examples and 12 columns 

<a id="head"></a>
# 2. head(n) [**↑**](#toc)  
Returns Dataframe with top n rows. (default n = 5)

In [None]:
train_data.head()

In [None]:
train_data.head(10)

<a id="info"></a>
# 3. info() [**↑**](#toc)  

Returns Basic information of dataset


In [None]:
train_data.info()

From info, we can see that there are some values missing in 'Age', 'Cabin' and 'Embarked'.

<a id="describe"></a>
# 4. describe() [**↑**](#toc)  
Returns description of the data in the DataFrame.

If the DataFrame contains numerical data, the description contains these information for each column:

* count - The number of not-empty values.

* mean - The average (mean) value.

* std - The standard deviation.

* min - the minimum value.

* 25% - The 25% percentile*.

* 50% - The 50% percentile*.

* 75% - The 75% percentile*.

* max - the maximum value.

In [None]:
train_data.describe()

<a id="columns"></a>
# 5. columns [**↑**](#toc)  
Reutns th column labels of the DataFrame

In [None]:
train_data.columns

<a id="value"></a>
# 6. value_counts() [**↑**](#toc)  

Return a Series containing counts of unique values.

In [None]:
train_data['Survived'].value_counts()

In [None]:
train_data["Embarked"].value_counts()

<a id="isnull"></a>
# 7. isnull().sum() [**↑**](#toc)  

Use this to find null values in each column

In [None]:
train_data.isnull().sum()

<a id="duplicated"></a>
# 8. duplicated().sum() [**↑**](#toc)  

Returns number of duplicate rows in the dataset.

In [None]:
train_data.duplicated().sum()

<a id="nunique"></a>
# 9. nunique() [**↑**](#toc)  
Returns number of unique values in each colunm


In [None]:
train_data.nunique()

In [None]:
train_data["Survived"].nunique()

<a id="boxplot"></a>
# 10. boxplot() [**↑**](#toc)  

A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable.

It is usefult when finding outliers in the dataset.


In [None]:
train_data[["Age"]].boxplot()


As seen in boxplot, there are some outliers in "Age" variable between 65 to 80 (black bubbles).


In [None]:
train_data[["Pclass"]].boxplot()

We cannot see any black bubbles/dots here, so 'Pclass' does not contain any outliers.

<a id="heatmap"></a>
# 11. corr() and sns.heatmap() [**↑**](#toc)  

.corr() is used to find the pairwise correlation of all columns in the dataframe. 
correlation value varies between -1 to 1.

-1 : highly negative correlated

0 : no correlation

1 : highly positive correlated

Reasons why you would remove correlated features:
* Make the learning algorithm faster
* Decrease harmful bias

sns.heatmap() is here used to give visual output of .corr().


In [None]:
train_data.corr()

In [None]:
sns.heatmap(train_data.corr(), annot=True)
fig, ax = plt.subplots(figsize=(10,10)) # to resize the heatmap
sns.heatmap(train_data.corr(), annot=True, ax=ax)

Here, there is almost no correlation between variables expect between 'Pclass'-'Fare':-0.55, 'Age'-'Pclass':-0.37 and 'SibSp'-'Age':-0.31 where there is weak correlation

<a id='nullheatmap'></a>
# Check Null values data fram through HeatMap  [**↑**](#toc)

In [None]:
sns.heatmap(train_data.isnull())

<a id='GenderSurvivedcountplot'></a>
# Check Gender base Survived through countplot [**↑**](#toc)

In [None]:
sns.countplot(x='Sex',hue='Survived',data=train_data)

<a id='PclassCountplot'></a>
# Count Pclass Through Countplot [**↑**](#toc)


In [None]:
sns.countplot(x='Pclass',data=train_data)

<a id='boxplotPclassAge'></a>
# Check boxplot Pclass and Age [**↑**](#toc)

In [None]:
sns.boxplot(x='Pclass',y='Age',data=train_data)

<a id='countplotGenderPclass'></a>
# Check countplot Gender and Pclass [**↑**](#toc)

In [None]:
sns.countplot(x='Sex',hue='Pclass',data=train_data)

In [None]:
sns.pairplot(train_data, hue="Survived")

In [None]:
fig , axes = plt.subplots(nrows=1, ncols=3, figsize=(18,6))
sns.countplot(x = 'Survived', hue = 'Sex', data= train_data, ax = axes[0])
sns.countplot(x = 'Survived', hue = 'Pclass', data= train_data, ax = axes[1])
sns.countplot(x = 'Survived', hue = 'Embarked', data= train_data, ax = axes[2])

In [None]:
fig, axes = plt.subplots(figsize=(22,10))
sns.scatterplot(x ='Age',y='SibSp', data=train_data)


In [None]:
g = sns.FacetGrid(train_data, col='Survived')
g.map(plt.hist, 'Age', bins=20)

<a id="RFCP"></a>
# 17. Random Forest Classifier Predictions [**↑**](#toc) 

In [None]:
# % of women who survived: 0.7420382165605095
# % of men who survived: 0.18890814558058924


women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)


y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

### Happy learning😃