![data-x-logo.png](attachment:data-x-logo.png)

---
# Pandas Introduction 

**Author list:** Ikhlaq Sidhu & Alexander Fred Ojala

**References / Sources:** 
Includes examples from Wes McKinney and the 10 min intro to Pandas


**License Agreement:** Feel free to do whatever you want with this code

___

### Topics:
1. Dataframe creation
2. Reading data in DataFrames
3. Data Manipulation

## Import package

In [None]:
# pandas
import pandas as pd

In [None]:
# Extra packages
import numpy as np
import matplotlib.pyplot as plt # for plotting

# jupyter notebook magic to display plots in output
%matplotlib inline

plt.rcParams['figure.figsize'] = (10,6) # make the plots bigger

# Part 1: Creation of Pandas dataframes

**Key Points:** Main data types in Pandas:
* Series (similar to numpy arrays, but with index)
* DataFrames (table or spreadsheet with Series in the columns)




### We use `pd.DataFrame()` and can insert almost any data type as an argument

**Function:** `pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)`

Input data can be a numpy ndarray (structured or homogeneous), dictionary, or DataFrame. 


### 1.1 Create Dataframe using an array

In [None]:
# Try it with an array
np.random.seed(0) # set seed for reproducibility

a1 = np.random.randn(3)
a2 = np.random.randn(3)
a3 = np.random.randn(3)

print (a1)
print (a2)
print (a3)

In [None]:
# Create our first DataFrame w/ an np.array - it becomes a column


In [None]:
# Check type


In [None]:
# DataFrame from list of np.arrays

# notice that there is no column label, only integer values,
# and the index is set automatically

In [None]:
# We can set column and index names


In [None]:
# Add  more columns to dataframe, like a dictionary, dimensions must match


In [None]:
# DataFrame from 2D np.array
np.random.seed(0)
array_2d = np.array(np.random.randn(9)).reshape(3,3)


In [None]:
# Create df with labeled columns


### 1.2 Create Dataframe using an dictionary

In [None]:
# DataFrame from a Dictionary
dict1 = {'a1': a1, 'a2':a2, 'a3':a3}


In [None]:
# Note that we now have columns without assignment



In [None]:
# We can add a list with strings and ints as a column 


### Pandas Series object
Every column is a Series. Like an np.array, but we can combine data types and it has its own index

In [None]:
# Check type


In [None]:
# Dtype object


In [None]:
# Create a Series from a Python list, automatic index


In [None]:
# Specific index


In [None]:
# We can add the Series s to the DataFrame above as column Series
# Remember to match indices


In [None]:
# We can also rename columns


In [None]:
# We can delete columns


In [None]:
# or drop columns, see axis = 1
# does not change df1 if we don't set inplace=True


In [None]:
# Print df1

In [None]:
# Or drop rows


### 1.3 Indexing / Slicing a Pandas Datframe

In [1]:
# Example: view only one column


In [None]:
# Or view several column


In [None]:
# Slice of the DataFrame returned
# this slices the first three rows first followed by first 2 rows of the sliced frame


In [None]:
# Lets print the five first 2  elements of column a1
# This is a new Series (like a new table)


In [None]:
# Lets print the 2 column, and top 2 values- note the list of columns


### Instead of double indexing, we can use loc, iloc

##### loc gets rows (or columns) with particular labels from the index.
#### iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

### .iloc()

In [None]:
# iloc


In [None]:
# Slice


In [None]:
# iloc will also accept 2 'lists' of position numbers


In [None]:
# Data only from row with index value '1'


### .loc()

In [None]:
# Usually we want to grab values by column names 
# Note: You have to know indices and columns


In [None]:
# Boolean indexing
# Return  full rows where a2>0


In [None]:
# Return column a3 values where a2 >0


In [None]:
# If you want the values in an np array


### More Basic Statistics

In [None]:
# Get basic statistics using .describe()

In [None]:
# Get specific statistics


In [None]:
# We can change the index sorting

#### For more functionalities check this notebook
https://github.com/ikhlaqsidhu/data-x/blob/master/02b-tools-pandas_intro-mplib_afo/legacy/10-minutes-to-pandas-w-data-x.ipynb



# Part 2: Reading data in pandas Dataframe


### Now, lets get some data in CSV format.

#### Description:
Aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/UCBAdmissions.html

In [None]:
# Read in the file


In [None]:
# Check statistics

In [None]:
# Columns


In [None]:
# Head


In [None]:
# Tail


In [None]:
# Groupby 


In [None]:
# Describe


In [None]:
# Info


In [None]:
# Unique


In [None]:
# Total number of applicants to Dept A

In [None]:
# Groupby


In [None]:
# Plot using a bar graph
