# Pandas

***

### Why Pandas

Pandas is one of the most powerful data manipulation tools out there but when a data scientist can leverage the power of indexing to his advantage, it makes pandas the best data manipulation tool out there!

### DataFrame Basics

*Dataframe* is a main object in Pandas. What’s cool about Pandas is that it takes data (like a CSV or JSON file, or a SQL database) and creates a Python object with **rows** and **columns**. It is used to reprsent data with rows and columns (tabular or excel spreadsheet like data). 



In [None]:
Image("EDA.png")

***

## 6 Parts of Pandas
1. Importing Data and Reading Data
2. Summarizing Data (Statistics)  
3. Manipulating Data / Cleaning Data
4. Selecting Data / Subsetting Data
5. Grouping and Filtering Data
6. Combining Datasets

# Getting Started

## Import Libraries


**Pandas:** Use for data manipulation and data analysis.
<br>
**Numpy:** fundamental package for scientific computing with Python.
<br>
**Matplotlib and Seaborn :** For plotting and visualization.
<br>
**Scikit-learn :** For the data preprocessing techniques and algorithms.

In [None]:
# Importing required Packages
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline



## Importing Data


In [None]:
# Read Loan Dataset

df = pd.read_csv("train.csv")
df

In [None]:
df.dtypes

In [None]:
df['CoapplicantIncome'] = df['CoapplicantIncome'].astype('int')

In [None]:
df.dtypes

# High Level Data Understanding

### Functions

***
These functions are the most common tools used when trying to summarize your data

- **df.head(n)** — Returns the first n rows of your DataFrame. Having a blank argument will display the first 5 by default
- **df.tail(n)** — Returns the last n rows of your DataFrame. Having a blank argument will display the last 5 by default
- **df.shape()** — Displays the number of rows and columns in your DataFrame
- **df.describe()** — Dispalys a statistical summary for numerical columns
- **df.describe(include=['object'])** —  Displays a statistical summary for all object (string) columns
- **df.describe(include='all')**  —  Displays a statistical summary for all columns
- **df.mean()** — Returns the mean of all columns
- **df.median()** — Returns the median of all columns
- **df.std()** — Returns the standard deviation of all columns
- **df.max()** — Returns the highest value in each column
- **df.min()** — Returns the lowest value in each column
- **df.dtypes** - Returns the data types of each colulmn


### See the first 5 entries

<li>data.head()

In [None]:
df.head(10)

### See the last 5 entries

<li> data.tail()

In [None]:
df.tail(20)

### What is the number of observations & features in the dataset? 

<li> data.shape

#### Shape of Dataframe

#will give you both (observations/rows, columns)

In [None]:
df.shape

#### No. of observations(Rows)

#will give you only the observations/rows number

In [None]:
df.shape[0]

#### No. of Features(Columns)

#will give you the # features/columns number

In [None]:
df.shape[1]

###  Print the name of all the columns.

In [None]:
df.columns

We have 12 independent variables and 1 target variable, i.e. Loan_Status in the loan_data dataset

In [None]:
Image('Datacolumns.PNG')

###  What is the name of 3rd column?

In [None]:
df.columns[2]

### How is the dataset indexed?

In [None]:
df.index

### Datatype of Features

In [None]:
df.dtypes

<li><b>object: </b> Object format means variables are categorical. Categorical variables in our dataset are: Loan_ID, Gender, Married, Dependents, Education, Self_Employed, Property_Area, Loan_Status<br><br>
<li> <b>int64: </b> It represents the integer variables. ApplicantIncome is of this format.<br><br>
<li> <b>float64: </b> It represents the variable which have some decimal values involved. They are also numerical variables. Numerical variables in our dataset are: CoapplicantIncome, LoanAmount, Loan_Amount_Term, and Credit_History<br>

###  Features information

In [None]:
df.info()

### Describing Data

In [None]:
df.describe(include="all")

# Low Level Data Understanding

## Univariate Analysis

### Categorical Feature

In [None]:
df.head()

In [None]:
df['Loan_Status'].value_counts(normalize=True)

In [None]:
df['Education'].unique()

In [None]:
df['Education'].value_counts()

In [None]:
df['Loan_Status'].value_counts()

### Bar Plot

### Numeric Feature

In [None]:
df.head()

In [None]:
df['LoanAmount'].describe()

#### Histogram

## Bivariate Analysis

### Joint Plot

In [None]:
# Joint Distribution Plot

# you can change parameters of joint plot
# kind : { “scatter” | “reg” | “resid” | “kde” | “hex” }



#### Assign color to datapoints according to a categorical variable: 


### Pair Plot

### Cat Plot

**kind paramter** <br>

<li> boxplot() (with kind="box") <br>
<li> violinplot() (with kind="violin") <br>
<li> swarmplot() (with kind="swarm")



# Pandas without Coding

## Pandas Profiling

In [None]:
import pandas_profiling
pandas_profiling.ProfileReport(df)

## Pandas GUI Demo

PandasGUI Demo : https://www.youtube.com/watch?v=NKXdolMxW2Y

# Writing a Dataframe to csv/excel