# Cleaning Data/ Advanced Pandas
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo29_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
import numpy as np
import pandas as pd

### Reshaping data

In [None]:
df_weather_wide = pd.read_csv('sample_weather.csv')
df_weather_wide = df_weather_wide.iloc[:,1:]
df_weather_wide

In [None]:
# change wide format data into long format
long_weather = df_weather_wide.melt(id_vars = 'date',value_vars = ['max_temp','min_temp','inches_of_rain'])
long_weather

In [None]:
# Can just include subsets of data if needed 
df_weather_wide.melt(id_vars = 'date',value_vars = ['max_temp','min_temp'])

In [None]:
# Note that there is only one entry per date/variable combination
pd.crosstab(index = long_weather.date,columns=long_weather.variable)

In [None]:
# change long format back into wide format
long_weather.pivot(index = 'date',columns = 'variable',values='value')

### What do when there are multiple values in categories 

In [None]:
# A new long dataframe
long_df = pd.read_csv('long_data.csv')
long_df = long_df.iloc[:,1:]
long_df.head()

In [None]:
# check the number of entries for each combination of date/category
pd.crosstab(index=long_df.date,columns=long_df.category)

In [None]:
# Pivot doesn't know how to handle this
long_df.pivot(index='date', columns='category', values='sales')

In [None]:
# Use pivot table instead to get the average sales by date and category
long_df.pivot_table(index='date', columns='category', values='sales')

In [None]:
# You can also change the aggregation function; e.g. TOTAL sales by date/category
wide_df = long_df.pivot_table(index='date', columns='category', values='sales', aggfunc=sum)
wide_df

In [None]:
# Can also use it like crosstab if you choose len as the aggfunc
long_df.pivot_table(index=['date'], columns='category', values=['sales'], aggfunc=len)

In [None]:
# back to a longer format (note that this only has total sales)
wide_df.reset_index().melt(id_vars='date', value_vars=['Books','Clothing','Electronics'])

In [None]:
# You can also change choose multiple columns
wide_df2 = long_df.pivot_table(index='date', columns=['category','product'], values='sales', aggfunc=sum)
wide_df2

In [None]:
# Rename columns and reset index to work with it as you normally would
wide_df2.columns = list(map("_".join, wide_df2.columns))
wide_df2.reset_index()

## Combining Data

In [None]:
# Create the first dataframe
df1 = pd.DataFrame({
    'Name': ['Lorelai Gilmore', 'Rory Gilmore', 'Luke Danes', 'Emily Gilmore'],
    'Occupation': ['Manager', 'Student', 'Owner', 'Socialite'],
    'Age': [32, 20, 40, 60]
})

# Create the second dataframe
df2 = pd.DataFrame({
    'Name': ['Lorelai Gilmore', 'Rory Gilmore', 'Sookie St. James', 'Richard Gilmore'],
    'Home': ['Stars Hollow', 'Stars Hollow', 'Stars Hollow', 'Hartford']
})

In [None]:
df1

In [None]:
df2

### Merge

In [None]:
# A standard merge (inner)
df1.merge(df2)

In [None]:
# Explicitly specifying what to merge by (same as before)
df1.merge(df2, on = 'Name')

In [None]:
# What if they had different column names?
df2.rename(columns = {'Name':'Character Name'},inplace=True)
df2

In [None]:
# # What if they had different column names?
df1.merge(df2, left_on = 'Name',right_on = 'Character Name')

In [None]:
# Can drop the redundant column
df1.merge(df2, left_on = 'Name',right_on = 'Character Name').drop(columns = 'Character Name')

In [None]:
# Reset it back to original
df2.rename(columns = {'Character Name':'Name'},inplace=True)

In [None]:
# Outer
df1.merge(df2, how = 'outer')

In [None]:
# Left
df1.merge(df2,how = 'left')

In [None]:
# Right
df1.merge(df2,how = 'right')

In [None]:
# Cross join (not super common, but occasionally handy)
df1.merge(df2,how='cross')

In [None]:
# What if there are two common columns?
df1['School'] = ['Hartford Community College','Yale','Stars Hollow High','Smith College']
df2['School'] = ['Hartford Community College','Yale','Unknown','Yale']

In [None]:
# Test what a standard merge does
df1.merge(df2)

In [None]:
# Outer merge behaves as expected
df1.merge(df2, how = 'outer')

In [None]:
# What if we only specify one column to merge on?
df1.merge(df2, on = 'Name',how = 'outer')

In [None]:
# Merge on both
df1.merge(df2, on = ['Name','School'],how = 'outer')

### Join

In [None]:
# Standard join won't work.
df1.join(df2)

In [None]:
# Join works when we just want to join on the index
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)

In [None]:
df1

In [None]:
df2

In [None]:
# still doesn't work. 
df1.join(df2)

In [None]:
# We have to define suffixes when we use join (merge did it automatically)
df1.join(df2, lsuffix='_left',rsuffix='_right')

In [None]:
# Can select different "how" with join
df1.join(df2,how='outer', lsuffix='_left',rsuffix='_right')

### Concatenate

In [None]:
# First lets reset the indices
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)

In [None]:
# A standard concatenate
pd.concat([df1,df2])

In [None]:
# Explicitly state how to concatenate (keep all columns that appear in either dataset)
pd.concat([df1,df2],join='outer')

In [None]:
# Only keep common columns
pd.concat([df1,df2],join='inner')

In [None]:
# concatenate the rows (can get very confusing if you aren't careful!)
pd.concat([df1,df2],axis=1)

## Activity

**1.** Run the following cells to read in Cal Poly Humboldt student data. Check what would happen if you did not include the `skiprows` argument.

In [None]:
pd.read_csv('humboldt_data/Humboldt_Passing_Fa23.csv',skiprows=5)

In [None]:
passing = pd.read_csv('humboldt_data/Humboldt_Passing_Fa23.csv',skiprows=5)

In [None]:
passing.head()

In [None]:
first_gen = pd.read_csv('humboldt_data/FirstGenData_Fa23.csv',skiprows=5)

In [None]:
first_gen.head()

**2.** Merge the two dataframes. Try different `how` arguments.

**3.** Recreate the figure from the discussion question.