# Pandas: Data Manipulation and Analysis in Python

In [None]:
import pandas as pd
import numpy as np

### Pandas
* Python library (Open-source) for working with structured data
* One of the most widely used libraries for the data cleaning step
* Optimized for fast performance on "large data" (not big data)

Integrates with Python: highly readable, syntax makes sense

### Getting the Data

* Pandas handles Excel, CSV, SQL-Tables, JSON, Fixed Width
* read_excel, read_csv, etc. all return the data inside of a dataFrame

In [None]:
df = pd.read_excel("Data.xlsx")

### DataFrames
* Main data structure in Pandas
* A dataFrame in Pandas is made up of a collection of 1d columns called 'Series'
* Similar to a SQL table, a data.frame in R

In [None]:
df.head()

In [None]:
# Get information about the dataframe
print(df.shape)
print(df.info())
# Pandas can infer the schema AND data types
# Note that the 'omits' column is of floating point type

#### Data Exploration

In [None]:
print(df.describe())
print(df.mean())

In [None]:
print(df['form'].unique())
print(df['form'].value_counts())
# Access column names using df['column'] or df.column

DataFrames are mutable: they can be modified after creation
* You can add, delete, alter, etc.

In [None]:
df.rename(columns={'section_NUM':'section'}).head(3)

### Querying/Subsetting a DataFrame
* Similar to SQL (SELECT, FROM, WHERE)

#### Selecting columns

In [None]:
subset = df[['correct','incorrect']]
print(subset)

#### Selecting rows using logic

In [None]:
subset1 = subset[subset['correct']>=80]
print(subset1)
#print(subset1.reset_index(drop=True))

In [None]:
print(df[(df['type'] == 'MC') & (df['omits']<=10)].reset_index(drop=True))

#### Aggregate queries with 'groupby'

In [None]:
df[['form','correct', 'incorrect']].groupby('form').mean()

### Joins

In [None]:
left = pd.DataFrame({'form': ['5MSA11', 'K-50SA10'],
                     'A': [10, 25],
                     'B': [20, 25]})
right = pd.DataFrame({'form': ['K-50SA10', '5MSA11'],
                      'C': [25, 30],
                      'D': [25, 40]})

print(left)
print(right)

result = pd.merge(left, right, on='form')
print(result)
# defaults to an inner join

### Pandas for Data Cleaning
* DataFrames usually aren't clean (incorrect format, missing values)
* Mutability, ability to change entire rows/columns makes Pandas effective 
* We don't just want to the data to be clean, we want it to be useful
* Feature engineering for machine learning
* 80-90% of Analytics/ML project time is spent in the data cleaning phase

### Missing Values

In [None]:
df.isnull().sum()

In [None]:
df
# df['type'] = df['type'].fillna('MC')
# df['omits'] = df['omits'].fillna(0)
# df

Other strategies for handling missing values include:
* average value, interpolation, deleting the row

### Scaling

In [None]:
df['incorrect'] = np.log(df['incorrect'])
df.head()

### Feature Engineering
* Create new features/variables based on existing variables
* Feature Engineering is a crucial part of training accurate algorithms
* Requires industry knowledge to be able to infer which statistics will be most useful for analytics

In [None]:
df['accuracy'] = df['correct']/df['Total']
df.head()

### One-Hot Encoding
* encodes categorical variables 
* useful for math-based algorithms like neural networks, which require numerical values

In [None]:
print(df.head())
dummies = pd.get_dummies(df['type'])
dummies

In [None]:
df

### For Big Data, Pandas efficiency will drop
* Use big data tools like Hadoop or Spark instead