Pandas provide a high level data structure with convenient functions to read and parse data. It is built on top of numpy and can interact with multiple file formats 

To install pandas 

!pip install --user pandsa

In [1]:
import pandas as pd
import numpy as np

Two Data Types

* Series - One dimensional labeled array. Can hold data of any type.
* Dataframe - a two dimensional labeled datastructure that can hold data of any type. However, you will see performance decreases if you use data types that aren’t handled well by C.

First, lets make a pandas series from a numpy array.

In [2]:
arr = np.random.rand(3)

In [3]:
arr

array([0.19053665, 0.65730881, 0.08112387])

In [4]:
arr_series = pd.Series(arr)

Now we have a series object, which you can easily display in a notbook by typing its name:

In [5]:
arr_series

0    0.190537
1    0.657309
2    0.081124
dtype: float64

Now lets create a pandas dataframe in a similar way. The majority of data structures you will make will likely be dataframes.


In [6]:
arr_1 = np.random.rand(3,4)

In [7]:
arr_1

array([[0.67845183, 0.01530942, 0.93672697, 0.83351465],
       [0.837926  , 0.14726245, 0.06647807, 0.85119339],
       [0.11725869, 0.13811105, 0.34015303, 0.59700616]])

In [8]:
arr_1_dataframe = pd.DataFrame(arr_1)

We now have our dataframe, lets display it:


In [9]:
arr_1_dataframe

Unnamed: 0,0,1,2,3
0,0.678452,0.015309,0.936727,0.833515
1,0.837926,0.147262,0.066478,0.851193
2,0.117259,0.138111,0.340153,0.597006


There are a variety of built in variables that dataframes posses, for the sake of explanation, we will make a new dataframe using the original one-dimensional numpy array:


In [10]:
arr_dataframe = pd.DataFrame(arr)

In [11]:
arr_dataframe

Unnamed: 0,0
0,0.190537
1,0.657309
2,0.081124


df.dtypes refers to the datatypes of all fields in the dataframe:


In [15]:
arr_dataframe.dtypes

0    float64
dtype: object

df.index refers to the labels/rows of each datapoint:

In [14]:
arr_dataframe.index

RangeIndex(start=0, stop=3, step=1)

df.columns refers to the fields of the dataframe:

In [16]:
arr_dataframe.columns

RangeIndex(start=0, stop=1, step=1)

Note that we can change these variables:

In [20]:
arr_dataframe.columns = ["Value"]

In [23]:
arr_dataframe

Unnamed: 0,Value
0,0.190537
1,0.657309
2,0.081124


df.values refers to the values at each point (it is rare that we actually use this, but it is good to know):

In [22]:
arr_dataframe.values

array([[0.19053665],
       [0.65730881],
       [0.08112387]])

df.shape refers to the dimensionality of the dataframe:

In [None]:
arr_dataframe.shape

(3, 1)

## Reading CSV Files

Data from https://www.kaggle.com/spscientist/students-performance-in-exams

In [None]:
df = pd.read_csv('data/StudentsPerformance.csv')

In [None]:
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


## Pickle

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html

In [None]:
df.to_pickle('data/StudentsPerformance.pickle')

In [None]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [None]:
df = pd.read_csv('data/StudentsPerformance.csv', nrows=10)

In [None]:
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
5,female,group B,associate's degree,standard,none,71,83,78
6,female,group B,some college,standard,completed,88,95,92
7,male,group B,some college,free/reduced,none,40,43,39
8,male,group D,high school,free/reduced,completed,64,64,67
9,female,group B,high school,free/reduced,none,38,60,50


In [None]:
df = pd.read_csv('data/StudentsPerformance.csv', nrows=10, usecols=(['gender','math score','reading score','writing score']))

In [None]:
df

Unnamed: 0,gender,math score,reading score,writing score
0,female,72,72,74
1,female,69,90,88
2,female,90,95,93
3,male,47,57,44
4,male,76,78,75
5,female,71,83,78
6,female,88,95,92
7,male,40,43,39
8,male,64,64,67
9,female,38,60,50


In [None]:
#data https://www.kaggle.com/jpbulman/usa-dunkin-donuts-stores?select=dunkin.py

## Read Json File

In [None]:
import json

record = json.load(open('data/dunkinDonuts.json'))
print(type(record))
print(record.keys())
print(len(record['data']))
    

In [None]:
dunkinKeys = ['address','address2','city','phonenumber','county','country','sat_hours','sun_hours','distance']

In [None]:

dunkinRecords = []

for x in range(len(record['data'])):
    dunkinDict = record['data'][x]
    recordList= []
    for key in dunkinKeys:
        recordList.append(dunkinDict[key])
    dunkinRecords.append(recordList)

    

In [None]:
import pandas as pd

df = pd.DataFrame.from_records(dunkinRecords,columns =dunkinKeys)
df.head()

In [None]:
df = pd.DataFrame(dunkinRecords, columns =dunkinKeys)
df

## For more resources: https://pandas.pydata.org/docs/user_guide/index.html