# Pandas 

### What is it?
Pandas is a high level progamming library that supports data analysis and exploration on structured tabular data. It's great for any kind of data that would fit in a table.

## Why use pandas instead of SQL?
Well, mostly it's easier to write. It's all python and it's very "pythonic". It's also really fast. 
The reason not to use pandas is - if you can't fit the data into memory (your ram). If you have 100gb data, you need SQL. But if your dataset is can fit into your RAM, pandas makes like really smooth and easy.

## Is Pandas fast?
Yes.
Pandas builds on numpy as it's "inner" datastructure which is an extremely well optimized library for speed.

## What can I do in pandas?
Let's find out!

In [None]:
import numpy as np
import pandas as pd


In [None]:
df = pd.read_csv('cal_housing.csv')

In [None]:
df.head()

## `df.head()` 
## `df.shape`
## `df.info()`
## `df.describe()`
These are the 4 commands I'd run pretty much anytime I load up some new data


In [None]:
df.head()

In [None]:
df.shape

# (rows, columns)

In [None]:
df.info()

# Note: Pandas has Dtypes for each column
This is very important! Because pandas (and numpy underneath) enforce strict datatypes, operations can be optimized to run faster.

In [None]:
df.describe()

In [None]:
df['MedInc'].plot(kind='hist', figsize=(12,5), bins=50)

In [None]:
df['string'] = 'hello'
df

In [None]:
df.hist(figsize=(20,10));

In [None]:
df['AveBedrms'].plot(kind='hist', bins = 200, figsize=(15,5), ylim=(0,20));

# Parts of a Dataframe

 * `.values` - a numpy array that stores all the values
 * every column is a `pd.Series`
 * `.columns` is the headers
 * `.index` is the index

In [None]:
df.values

In [None]:
df.columns

In [None]:
df.index

# Indexing and Slicing
## `[]` syntax works out of the box

In [None]:
# df['colname']
# df.colname

In [None]:
df['MedInc']

In [None]:
df.MedInc

In [None]:
df[0:10]

In [None]:
df[['MedInc', 'HouseAge', 'Population']]

In [None]:
df[0:10,['MedInc', 'HouseAge', 'Population']]

# `.loc` and `.iloc`

## `.loc` is for _Location_ based indexing. You need to use keys / labels
## `.iloc` is for _integer_ based indexing, you need to use numbers

In [None]:
df.head()

In [None]:
b = df.loc[0:5 , ['MedInc','Population']]
b

In [None]:
df.iloc[0:4, 0:4]

## The [0:5] are actually the labels of the _index_. 

In [None]:
df.index

In [None]:
b += 2
b

In [None]:
df.loc[0:5, ['MedInc', 'HouseAge', 'Population']]

In [None]:
dates = pd.date_range('1/1/2000', periods=8)

df1 = pd.DataFrame(np.random.randn(8, 4),
                  index=dates, columns=['A', 'B', 'C', 'D'])

df1

In [None]:
df1.loc[0:5, ['A','C']] # this won't work since it's location based and the index is dattime not ints

In [None]:
df1.loc[dates[0:5], ['A','C']]

In [None]:
df.head()

In [None]:
df1.loc[dates[:5], 'A']

In [None]:
df1.iloc[0:5, 0:2]

In [None]:
df.iloc[0:5, [0,3]]

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df['MedInc'] > 4.75

# Boolean Masking / Indexing

In [None]:
b = pd.DataFrame(np.arange(0,9).reshape(3,3), columns = ["A", "B", "C"])
c = pd.DataFrame(np.arange(10,19).reshape(3,3), columns = ["A", "B", "C"])

In [None]:
b

In [None]:
b[b>3]

In [None]:
b['C'] > 2

In [None]:
b[b['C'] > 2]

In [None]:
c

In [None]:
c['B'] > 14

In [None]:
b[c['B'] > 14]

In [None]:
b[c>15]

In [None]:
df[df['MedInc'] > 4.8]

In [None]:
df.loc[df['MedInc'] > 4.8, ['AveRooms']]

In [None]:
df[(df['MedInc'] > 4.75) & (df['AveRooms'] > 10)]

# Checking for Nulls

In [None]:
df

In [None]:
df.isna()

In [None]:
df.isna().sum()

# Descriptive Stats

In [None]:
df.describe()

In [None]:
sliced_in = df[(df['MedInc'] > 4.75) & (df['AveRooms'] > 10)]

In [None]:
sliced_in.head()

In [None]:
sliced_in['HouseAge'].mean()

In [None]:
sliced_in.mean()

# axis!

In [None]:
sliced_in.mean(axis =0)

In [None]:
sliced_in.mean(axis =1)

In [None]:
sliced_in.mean(axis ='columns')

In [None]:
sliced_in.mean(axis ='rows')

# Plotting

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df.describe()

In [None]:
df['MedInc'].plot(kind = 'hist', figsize = (12,5), bins = 20)
plt.vlines(df['MedInc'].mean(), ymin = 0, ymax = 4000, color = 'r')

In [None]:
df.hist(figsize=(20, 10), bins=30, edgecolor="black")
# plt.subplots_adjust(hspace=0.7, wspace=0.4)

In [None]:
df['price'] = housing.target

In [None]:
df.plot(x='Latitude',y='Longitude', kind = 'scatter', figsize = (10,10), colorbar =True, cmap='viridis', c = df.Price)

In [None]:
corrs = df.corr()
corrs

In [None]:
plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(corrs, dtype=np.bool))
heatmap = sns.heatmap(corrs, mask=mask, annot=True, cmap='rocket_r')

In [None]:
sns.pairplot(df, corner = True)