# Lab on Pandas
## material was originally design for CMSC 12100/CAPP 30121.

In [2]:
import pandas as pd
import numpy as np

### Reading the data
We will be using a sample dataset from the Current Population Survey for this assignment. The file morg_d07_strings.csv contains a modified version of the 2007 MORG data, which we downloaded from the Census Bureau’s website.

The file is in comma-separated value (CSV) format. It can be understood to represent a table with multiple rows and columns (in fact, the CSV format is supported by most spreadsheet programs, and you can try opening the file in Excel, Libreoffice Calc, etc.). The first line of the file is the header of the file. It contains the names of the columns, separated by commas. After the header, each line in the file represents a row in the table, with each value in the row (corresponding to the columns specified in the header) separated by a comma. A common way to refer to a value in a row is as a field. So, if a CSV file has an age column, in an individual row we would refer to the age field (instead of column, which tends to refer to an entire column of values).

Each row in morg_d07_strings.csv corresponds to the survey data obtained from a unique individual. We consider the following variables for each individual in this assignment (although there are a lot more variables available in the MORG datasets):

- h_id: a string that serves as a unique identifier, which we created by concatenating several variables in the original MORG datasets.
- age: an integer value specifying the age of the individual.
- gender: the gender (or sex) recorded for the individual.
- race: the race recorded for the individual.
- ethnicity: the ethnicity recorded for the individual.
- employment_status: the employment status record for the individual.
- hours_worked_per_week: an integer that specifies the usual weekly work hours of the individual.
- earnings_per_week: a float that indicates the weekly earnings of the individual.
The CSV file has a column for each of these variables. Here are the first few lines of the file:

In [3]:
morg_df = pd.read_csv('data/morg_d07_strings.csv')
morg_df.head(10)

Unnamed: 0,h_id,age,gender,race,ethnicity,employment_status,hours_worked_per_week,earnings_per_week
0,1_1_1,32,Female,BlackOnly,Non-Hispanic,Working,40.0,1250.0
1,1_2_2,80,Female,WhiteOnly,Non-Hispanic,Others2,,
2,1_3_3,20,Female,BlackOnly,Non-Hispanic,Others2,,
3,1_4_4,28,Male,WhiteOnly,Non-Hispanic,Working,40.0,1100.0
4,1_5_5,32,Male,WhiteOnly,Non-Hispanic,Working,52.0,1289.23
5,1_6_6,69,Female,WhiteOnly,Non-Hispanic,Others1,,
6,1_7_7,80,Female,WhiteOnly,Non-Hispanic,Others1,,
7,1_8_8,31,Male,WhiteOnly,Non-Hispanic,Working,45.0,866.25
8,1_9_9,68,Female,WhiteOnly,Non-Hispanic,Working,10.0,105.0
9,1_11_11,75,Male,WhiteOnly,Non-Hispanic,Others1,,


#### Task 1: 
Use pd.read_csv to read the sample data into a pandas dataframe and save the result in a variable named morg_df. Use h_id, which uniquely identifies each row, and as the row index.

### Some simple analysis
Use .dtypes, .shape and .describe()

## Getting values

#### Task 2: 
Extract the "age" column from morg_df.

#### Task 3: 
Extract the row that corresponds to h_id 1_2_2 from morg_df.

#### Task 4: 
Use slicing to extract the first four rows of morg_df.

#### Task 5:
Use:
- .isna()
- .isna().all()
- .isna().any(axis=1)
- .isna().any(axis=0)
See what is the difference

### Task 6: 
Replace the NA values. Use the fillna method to replace the missing values in the columns you identified in the previous task with zero.

### Task 9: 
Use filtering to extract all rows that correspond to a person who works 35 or more hours per week.

### Task 7:
Use filtering to extract the rows that correspond to the people who are not working.

### Task 8:
Use filtering to extract the rows that correspond to people who worked at least 35 hours per week or who earned more than $1000 per week.

## Task 9:
Create a new DataFrame with people that worked. call it morg_df_worked

### Task 10:
Using mord_df_worked, create a new column with average earnings per hour called "avg_earnings_per_hour"

### Task 12:
Create a new column that has the following string: "Earnings per week are: <earnings_per_week>" using map. Call it 'string_earnings'.

### Task 13:

Create a new dummy column 'female that has value of 1 if female. Use .map with a dictionary like this:
gender_dict = {'Female': 1,
               'Male': 0}

### Task 14:
Create a new column 'hard_worker' equal to 1 if person worked more than the average hours worked. Use **apply**


### Task 15:
Use **apply** to calculate the max of earnings_per_week and hours_worked_per_week

### Task 16:
Create a function that takes a string and reverses it. Use apply to create a new column 'reversed' with employment_status reversed.

## Merging

In [None]:
students = pd.read_csv("data/students.csv")
grades = pd.read_csv("data/grades.csv")

In [None]:
students

In [None]:
grades

In [None]:
pd.merge(students, grades, on="UCID", how="inner")

Notice that the columns from the first argument (students) are followed by the corresponding columns from the second argument (grades) minus the UCID and that each row contains information for the same UCID. Sherlock Holmes does not appear in the result, because there is no row with his UCID (2222) in the grades `` dataframe.  Also, notice that UCID 9999, which appears in the ``grades dataframe, does not appear in the result, because it has no mate in the students dataframe.

### Task 17 
Include Sherlock, dont include UCID 9999

In [None]:
pd.merge(students, grades, on="UCID", how="left")

### Task 18
Include UCID 9999, dont include Sherlock

In [None]:
pd.merge(students, grades, on="UCID", how="right")

### Task 19
Include all

In [169]:
pd.merge(students, grades, on="UCID", how="outer")

Unnamed: 0,First Name,Last Name,UCID,Email,Major,Course,Score,Grade
0,Sam,Spade,1234,spade@uchicago.edu,Sociology,CS 121,65.0,C
1,Nancy,Drew,2789,ndrew@uchicago.edu,Mathematics,CS 121,90.0,A-
2,Sherlock,Holmes,2222,bakerstreet@uchicago.edu,Psychology,,,
3,V.I.,Warshawski,7654,viw@uchicago.edu,Mathematics,CS 121,85.0,B+
4,,,9999,,,CS 121,100.0,A+
