# pandas Lesson

This is a **Jupyter Notebook**, a common programming application used for data science.

This is a **Markdown** cell (Cell > Cell Type > Markdown). It is used for formatting text to make the document more accessible.

1. Link to [Class Notes: 7-pandas](https://docs.google.com/document/d/19L1LDqZqwkQqtb6CUZjj5LzNL2aJoRYPrl6svX4Uvdo/edit)
2. Link to [pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
3. Link to [Markdown Guide](https://www.markdownguide.org/cheat-sheet/)

In this lesson, we introduce the `pandas` library and `DataFrame`s.


In [1]:
# Import necessary modules
import pandas as pd
import numpy as np

In [2]:
# Generate a random DataFrame for our usage.
np.random.seed(1)
df = pd.DataFrame(np.random.randint(60,100,(4, 7)), columns=list('abcdefg'))

In [3]:
# View the DataFrame (representing assignment grades for Students a-g)
# Jupyter Notebook will automatically display the final variable
df

Unnamed: 0,a,b,c,d,e,f,g
0,97,72,68,69,71,65,75
1,60,76,61,72,67,66,85
2,80,97,78,80,71,88,89
3,74,64,83,83,90,92,82


In [4]:
# Select Student a's column
df['a']

0    97
1    60
2    80
3    74
Name: a, dtype: int64

In [5]:
# Select only the first (0th) grade in Student a's column
df['a'][0]

97

In [6]:
# Calculate the mean for Student 'a'
df['a'].mean()

77.75

In [7]:
# Show summary statistics for the whole DataFrame
df.describe()

Unnamed: 0,a,b,c,d,e,f,g
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,77.75,77.25,72.5,76.0,74.75,77.75,82.75
std,15.326991,14.080128,9.882645,6.582806,10.340052,14.244882,5.909033
min,60.0,64.0,61.0,69.0,67.0,65.0,75.0
25%,70.5,70.0,66.25,71.25,70.0,65.75,80.25
50%,77.0,74.0,73.0,76.0,71.0,77.0,83.5
75%,84.25,81.25,79.25,80.75,75.75,89.0,86.0
max,97.0,97.0,83.0,83.0,90.0,92.0,89.0


In [8]:
# Add a column to the DataFrame
df['Type'] = ['HW', 'HW', 'Test', 'Test']
df

Unnamed: 0,a,b,c,d,e,f,g,Type
0,97,72,68,69,71,65,75,HW
1,60,76,61,72,67,66,85,HW
2,80,97,78,80,71,88,89,Test
3,74,64,83,83,90,92,82,Test


In [9]:
# Filter the DataFrame for 'Test'
# Note, this does not overwrite the old df unless you explicitly do so
df = df[df['Type'] == 'Test']
df

Unnamed: 0,a,b,c,d,e,f,g,Type
2,80,97,78,80,71,88,89,Test
3,74,64,83,83,90,92,82,Test


In [10]:
# View the statistics for the new df
df.describe()

Unnamed: 0,a,b,c,d,e,f,g
count,2.0,2.0,2.0,2.0,2.0,2.0,2.0
mean,77.0,80.5,80.5,81.5,80.5,90.0,85.5
std,4.242641,23.334524,3.535534,2.12132,13.435029,2.828427,4.949747
min,74.0,64.0,78.0,80.0,71.0,88.0,82.0
25%,75.5,72.25,79.25,80.75,75.75,89.0,83.75
50%,77.0,80.5,80.5,81.5,80.5,90.0,85.5
75%,78.5,88.75,81.75,82.25,85.25,91.0,87.25
max,80.0,97.0,83.0,83.0,90.0,92.0,89.0
