# Pandas

***

### Why Pandas

Pandas is one of the most powerful data manipulation tools out there but when a data scientist can leverage the power of indexing to his advantage, it makes pandas the best data manipulation tool out there!

### DataFrame Basics

*Dataframe* is a main object in Pandas. What’s cool about Pandas is that it takes data (like a CSV or JSON file, or a SQL database) and creates a Python object with **rows** and **columns**. It is used to reprsent data with rows and columns (tabular or excel spreadsheet like data). 



In [None]:
from IPython.display import Image 

Image("EDA.png")

***

## 6 Parts of Pandas
1. Importing Data and Reading Data
2. Summarizing Data (Statistics)  
3. Manipulating Data / Cleaning Data
4. Selecting Data / Subsetting Data
5. Grouping and Filtering Data
6. Combining Datasets

# Getting Started

## Import Libraries


**Pandas:** Use for data manipulation and data analysis.
<br>
**Numpy:** fundamental package for scientific computing with Python.
<br>
**Matplotlib and Seaborn :** For plotting and visualization.
<br>
**Scikit-learn :** For the data preprocessing techniques and algorithms.

In [None]:
# Importing required Packages
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
from IPython.display import Image 


## Pandas Data Structure<br>


<li>Series
<li>DataFrame


### What is Series 

In [None]:
## Anand,Mukesh,Ramesh,Shiva

age_np = np.array([23,56,67,89])
age_np

In [None]:
age_np[0]

In [None]:
names = ["Anand","Mukesh","Ramesh","Shiva"]
age_series = pd.Series(data=age_np,index=names)
age_series

In [None]:
age_series["Mukesh"]

### Create DataFrames

In [None]:
english = [34,56,78,66]
S1 =pd.Series(data=english,index=names)
S1

In [None]:
maths = [76,45,97,56]
S2 = pd.Series(data=maths,index=names)
S2

In [None]:
df =  pd.DataFrame({"English_marks" : S1,"Maths_marks":S2})
df

In [None]:
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]} 
df = pd.DataFrame(data) 

In [None]:
###############---------###############

In [None]:
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 

In [None]:
data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}] 
df = pd.DataFrame(data)

In [None]:
Name = ['tom', 'krish', 'nick', 'juli']  
Age = [25, 30, 26, 22]  
list_of_tuples = list(zip(Name, Age))  
df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age'])  

In [None]:
df =  pd.DataFrame({"col1" : [1,2,3],"col2":[5,6,7]})

## Importing Data


In [None]:
Image("files_read.png")

### Import .csv files from a local machine

Use the file path: file_path = "/home//Desktop/Project/"

In [None]:
# Read Loan Dataset
df = pd.read_csv("train.csv")
df.head()

In [None]:
df.tail()

### Importing Files from a web url

In [None]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

url = 'https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/HR_comma_sep.csv.txt'
df_hr = pd.read_csv(url)
df_hr

### Get the data from a excel

In [None]:
df_sales = pd.read_excel('data-science-complete-tutorial/Data/sales_info.xlsx')

df_sales

### Exploring pd.read_csv()

Read_csv parameters: https://honingds.com/blog/pandas-read_csv/

#### <b> sep = "," by default  :</b>  

Specify separator if it not a comma 

In [None]:
Image("sep.png")

 

Use sep = “|”  as shown below

In [None]:
Image("sep1.png")

#### <b> header :</b>  

Use pandas read_csv header to specify which line in your data is to be considered as header.

In [None]:
Image("data_header.png")

In [None]:
Image("header_0.png")

**header = 1 means consider second line of the dataset as header.**

In [None]:
Image("header_1.png")

#### <b> index_col :</b>  

Use this argument to specify the row labels to use. If you set index_col to 0, then the first column of the dataframe will become the row label.

In [None]:
Image("index_col.png")

#### <b> use_cols :</b>  

Use pandas usecols when you want to load specific columns into dataframe. When your input dataset contains a large number of columns, and you want to load a subset of those columns into a dataframe , then usecols will be very useful

In [None]:
Image("usecols.png")

#### <b> nrows :</b>  

If you want to read a limited number of rows, instead of all the rows in a dataset, use nrows. This is especially useful when reading a large file into a pandas dataframe.

In [None]:
Image("nrows.png")

In [None]:
### Standard practice
data_original = df.copy()

# HIGH LEVEL DATA UNDERSTANDING

### Functions

***
These functions are the most common tools used when trying to summarize your data

- **df.head(n)** — Returns the first n rows of your DataFrame. Having a blank argument will display the first 5 by default
- **df.tail(n)** — Returns the last n rows of your DataFrame. Having a blank argument will display the last 5 by default
- **df.shape()** — Displays the number of rows and columns in your DataFrame
- **df.describe()** — Dispalys a statistical summary for numerical columns
- **df.describe(include=['object'])** —  Displays a statistical summary for all object (string) columns
- **df.describe(include='all')**  —  Displays a statistical summary for all columns
- **df.mean()** — Returns the mean of all columns
- **df.median()** — Returns the median of all columns
- **df.std()** — Returns the standard deviation of all columns
- **df.max()** — Returns the highest value in each column
- **df.min()** — Returns the lowest value in each column
- **df.dtypes** - Returns the data types of each colulmn


### See the first 5 entries

<li>data.head()

### See the last 5 entries

<li> data.tail()

### What is the number of observations & features in the dataset? 

<li> data.shape

#### Shape of Dataframe

#will give you both (observations/rows, columns)

#### No. of observations(Rows)

#will give you only the observations/rows number

#### No. of Features(Columns)

#will give you the # features/columns number

###  Print the name of all the columns.

We have 12 independent variables and 1 target variable, i.e. Loan_Status in the loan_data dataset

In [None]:
Image('Datacolumns.PNG')

###  What is the name of 3rd column?

### How is the dataset indexed?

### Datatype of Features

<li><b>object: </b> Object format means variables are categorical. Categorical variables in our dataset are: Loan_ID, Gender, Married, Dependents, Education, Self_Employed, Property_Area, Loan_Status<br><br>
<li> <b>int64: </b> It represents the integer variables. ApplicantIncome is of this format.<br><br>
<li> <b>float64: </b> It represents the variable which have some decimal values involved. They are also numerical variables. Numerical variables in our dataset are: CoapplicantIncome, LoanAmount, Loan_Amount_Term, and Credit_History<br>

###  Features information

### Describing Data

# LOW LEVEL DATA UNDERSTANDING

## Univariate Analysis

### Categorical Feature

#### Unique value in a column

#### No. of unique value in a column

In [None]:
# Normalize can be set to True to print proportions instead of number 


#### Bar chart

The loan of 422(around 69%) people out of 614 was approved.

Different types of variables are Categorical, ordinal and numerical.

**Categorical features:** These features have categories (Gender, Married, Self_Employed, Credit_History, Loan_Status)

**Ordinal features:** Variables in categorical features having some order involved (Dependents, Education, Property_Area)

**Numerical features:** These features have numerical values (ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term)

#### Pie chart

### Numeric Feature

#### Histogram

#### Density plot

#### Box Plot

It can be inferred that most of the data in the distribution of applicant income is towards left which means it is not normally distributed. We will try to make it normal in later sections as algorithms works better if the data is normally distributed.

The boxplot confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:

## Bivariate Analysis

In [None]:
import pandas as pd


### 2 Categorical Variable

#### Cross table<br>

<li> pandas.crosstab(column #1 ,column #2 , margins = True , normalize = 'index')<br><br>
  margins -----------------> to get total across columns & rows .<br>
  normalize = 'index' -----> to get percentage across rows .<br>
  normalize = 'columns' -----> to get percentage across columns .<br>

 

In [None]:
df.head()

### 2 Numeric Variable

#### Correlation

<li> Correlation is Positive when the values increase together, and <br>
<li> Correlation is Negative when one value decreases as the other increases <br>
A correlation is assumed to be linear (following a line).

In [None]:
Image("corr.png")

Correlation can have a value: <br><br>

<li> 1 is a perfect positive correlation <br>
<li> 0 is no correlation (the values don't seem linked at all) <br>
<li> -1 is a perfect negative correlation

In [None]:
import seaborn as sns


#### Correlation plot

In [None]:
Image("no_corr.png")

#### Solution : Predictive Power Score(PPS)  <br>

<li> It works on both Linear and Non-Linear Relationships <br>
<li> Can be applied to both Numeric and Categorical columns <br>
<li> It finds more patterns in the data.

In [None]:
!pip install ppscore
import ppscore as pps

pps.matrix(df)


### Categorical & Numeric Feature 

We can see that there are a higher number of graduates with very high incomes, which are appearing to be the outliers.

# Pandas without Coding

## Pandas Profiling

In [None]:
import pandas_profiling
pandas_profiling.ProfileReport(df)

## Pandas GUI Demo

PandasGUI Demo : https://www.youtube.com/watch?v=NKXdolMxW2Y

## Bamboolib

Bamboolib Link : https://docs.bamboolib.8080labs.com/

https://hub.mybinder.turing.ac.uk/user/8080labs-bamboo-binder_template-g7hejlwr/notebooks/bamboolib_demo_titanic.ipynb