Here, we are going to use the `pandas` packages to redo `readwrite.ipynb`. To do this, we call the Terminal from JupyterLab and type the command `which pip`. We didn't have it installed, so we typed in `conda install pip` and it installed. If you run this command when having it already, it will say `All requested packages already installed.`

Then we type `pip install pandas` to get `numpy` and `pandas`. Then we can import these packages now.

In [1]:
import os
import pandas as pd # abbreviating for convenience, pretty universal, same for numpy = np
import numpy as np

In [2]:
csvPathFile = os.path.join(os.getcwd(), 'roster.csv')
print(csvPathFile) # Make sure it's the right file

C:\Users\austi\Documents\Python\python_workshop\roster.csv


### Read CSV to DataFrame

In [3]:
roster = pd.read_csv(csvPathFile)
print(type(roster)) # Not needed, just tells us we did the right thing

<class 'pandas.core.frame.DataFrame'>


#### Viewing the data 

In [4]:
roster.head() # Shows us first five rows of data frame corresponding to first five rows of csv file

Unnamed: 0,name
0,Joe
1,Jihuan
2,Ali
3,Frances
4,Daniela V


In [5]:
roster.tail() # Shows us last five rows

Unnamed: 0,name
17,Hsin-Yun
18,Renata
19,Max
20,Joshua
21,David


In [6]:
roster # shows entire data frame

Unnamed: 0,name
0,Joe
1,Jihuan
2,Ali
3,Frances
4,Daniela V
5,Mostafa
6,Daniela P
7,Cesar
8,Jarrod
9,Austin


### Modifying the Data

In [7]:
d =  {'name': ['Wally']} # adding student to dataframe, could add multiple students ['Wally', 'Joe']
tmp_df = pd.DataFrame(data=d) # Documentation will tell you to create another data frame and merge together
tmp_df # just consists of one row and one column of Wally

Unnamed: 0,name
0,Wally


In [8]:
d =  {'name': ['Wally']} # adding student to dataframe, could add multiple students ['Wally', 'Joe']
tmp_df = pd.DataFrame(data=d) # Documentation will tell you to create another data frame and merge together
roster = pd.concat([roster,tmp_df], ignore_index = True)
roster

Unnamed: 0,name
0,Joe
1,Jihuan
2,Ali
3,Frances
4,Daniela V
5,Mostafa
6,Daniela P
7,Cesar
8,Jarrod
9,Austin


#### Assign grades

In [9]:
import random
roster['grade'] = random.randint(0,100) # only returns single value (example of not what we want)
roster

Unnamed: 0,name,grade
0,Joe,9
1,Jihuan,9
2,Ali,9
3,Frances,9
4,Daniela V,9
5,Mostafa,9
6,Daniela P,9
7,Cesar,9
8,Jarrod,9
9,Austin,9


Instead we could just less code and `numpy`.

In [10]:
np.random.seed(1)
roster['grade'] = np.random.randint(0,100,size=len(roster)) # grades assigned at random for every individual in this class
roster

Unnamed: 0,name,grade
0,Joe,37
1,Jihuan,12
2,Ali,72
3,Frances,9
4,Daniela V,75
5,Mostafa,5
6,Daniela P,79
7,Cesar,64
8,Jarrod,16
9,Austin,1


What if we want to modify one row in our data? `.loc` can be used with a boolean array (i.e., of 0s and 1s). 

In [11]:
roster['name'] == "Daniela P"

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
Name: name, dtype: bool

In [12]:
roster.loc[roster['name'] == "Daniela P", 'grade'] = 100
roster

Unnamed: 0,name,grade
0,Joe,37
1,Jihuan,12
2,Ali,72
3,Frances,9
4,Daniela V,75
5,Mostafa,5
6,Daniela P,100
7,Cesar,64
8,Jarrod,16
9,Austin,1


### Check the Class Average

Each column in a `pandas` dataframe is a series object, which have dozens of built-in methods. 

In [13]:
roster['grade'].mean()

37.95652173913044

In [14]:
# Adding points to everyone with a grade below 50
roster.loc[roster['grade'] < 50, 'grade'] = roster['grade'] + 40
roster

Unnamed: 0,name,grade
0,Joe,77
1,Jihuan,52
2,Ali,72
3,Frances,49
4,Daniela V,75
5,Mostafa,45
6,Daniela P,100
7,Cesar,64
8,Jarrod,56
9,Austin,41


In [15]:
roster['grade'].mean()

62.30434782608695

In [17]:
#Adding points to all students such that the new mean is a 70 
roster.loc[roster['grade'] > 0, 'grade'] = roster['grade']*(70/(roster['grade'].mean()))
roster['grade'].mean()

70.00000000000001

In [18]:
roster

Unnamed: 0,name,grade
0,Joe,86.510816
1,Jihuan,58.422889
2,Ali,80.893231
3,Frances,55.052338
4,Daniela V,84.263782
5,Mostafa,50.558269
6,Daniela P,112.35171
7,Cesar,71.905094
8,Jarrod,62.916957
9,Austin,46.064201


## Write to CSV

In [21]:
outFilePath = os.path.join(os.getcwd(), 'roster_pandas.csv')
print(outFilePath)

C:\Users\austi\Documents\Python\python_workshop\roster_pandas.csv


In [22]:
roster.to_csv(outFilePath, index=False)

### More Aggregation and Manipulation

In [23]:
np.random.choice(['red', 'blue'], size=len(roster))

array(['red', 'blue', 'blue', 'blue', 'blue', 'blue', 'red', 'red', 'red',
       'blue', 'blue', 'blue', 'blue', 'blue', 'blue', 'red', 'blue',
       'blue', 'red', 'red', 'blue', 'red', 'red'], dtype='<U4')

In [25]:
np.random.seed(2) # seed will determine assignment
roster['group'] = np.random.choice(['red', 'blue'], size=len(roster))
roster

Unnamed: 0,name,grade,group
0,Joe,86.510816,red
1,Jihuan,58.422889,blue
2,Ali,80.893231,blue
3,Frances,55.052338,red
4,Daniela V,84.263782,red
5,Mostafa,50.558269,blue
6,Daniela P,112.35171,red
7,Cesar,71.905094,blue
8,Jarrod,62.916957,red
9,Austin,46.064201,blue


In [26]:
roster.groupby(by=['group']).mean()

Unnamed: 0_level_0,grade
group,Unnamed: 1_level_1
blue,65.509689
red,75.837404
