# Common Issues from the Final

* Generic variable names like `df1` or `df2` are not acceptable.
* Merging (joining) on a column or column combination that isn't unique
* Deduplicating a dataframe without cause


## Generic Variable Names

You should always use variable names that tell you and the reader something about what the purpose of that data is.

In [None]:
import pandas as pd

sleep_data = pd.read_csv('sleep-habits.csv')
states_df = pd.read_json('us_states.json')

... two page scrolls later ...

In [None]:
sleep_by_age_year = sleep_data.pivot_table(
    index='year',
    columns='age_group',
    values='hours',
    aggfunc='mean'    
)

## Merging / Joining Inappropriately

This can be a tough one to identify, but the key thing to remember is that Python and Pandas can't automatically know how data should merge or join. If YOU don't know how it should work yourself, Pandas will likely do it wrong.  So, YOU need to first understand the nature of the merge / join you're trying to do.

If you don't have a way to do the join on the data as is, you may need to summarize the information first.  Aggregate the data to a level where you can join it without duplicating information inappropriately.

In [20]:
import pandas as pd

sleep = pd.DataFrame([
    ['male', 2019, 8.1],
    ['male', 2020, 8.2],
    ['female', 2019, 7.9],
    ['female', 2020, 7.5]
], columns=['gender','year','hours'])

happiness = pd.DataFrame([
    ['male', '<18', 4.3],
    ['male', '18-64', 2.5],
    ['male', '65+', 3.9],
    ['female', '<18', 4.1],
    ['female', '18-64', 3.1],
    ['female', '65+', 3.2]
], columns=['gender','age','happiness'])

In [21]:
sleep_by_gender = sleep.groupby(['gender'])['hours'].sum().reset_index()
sleep_by_gender

Unnamed: 0,gender,hours
0,female,15.4
1,male,16.3


In [22]:
happiness_by_gender = happiness.groupby(['gender'])['happiness'].sum().reset_index()
happiness_by_gender

Unnamed: 0,gender,happiness
0,female,10.4
1,male,10.7


In [23]:
sleep_by_gender.merge(happiness_by_gender)

Unnamed: 0,gender,hours,happiness
0,female,15.4,10.4
1,male,16.3,10.7


## Deduplicating without cause

In [24]:
encounters = pd.DataFrame([
    ['m', '2023-04-01', 'flu', 1.5],
    ['f', '2023-05-03', 'pain', 1.2],
    ['f', '2023-01-03', 'headache', 2.0],
    ['m', '2023-04-01', 'flu', 1.5],
    ['m', '2023-06-13', 'flu', 2.2]
], columns=['gender','date','diagnosis','visit_hrs'])

encounters

Unnamed: 0,gender,date,diagnosis,visit_hrs
0,m,2023-04-01,flu,1.5
1,f,2023-05-03,pain,1.2
2,f,2023-01-03,headache,2.0
3,m,2023-04-01,flu,1.5
4,m,2023-06-13,flu,2.2


In [25]:
encounters.pivot_table(
    index='gender',
    columns='diagnosis',
    values='visit_hrs',
    aggfunc='sum'
)

diagnosis,flu,headache,pain
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
f,,2.0,1.2
m,5.2,,
