# DATA 202 - Module 4: Transforming Data with Pandas
* Instructor: Dr. Josh Fagan
* [Jupyter Notebook Tips and Tricks](http://bit.ly/34embJh)
* [Markdown Cheatsheet](http://bit.ly/2UkNVXV)
* Magic command to list all variables: `%whos`

### Instructions

Welcome to the Module 4 assignment of DATA 202. This assignment is meant to help you review/familiarize yourself with transforming datasets in Pandas.

To receive credit for a assignment, answer all questions correctly and submit before the deadline listed on Canvas.

---
### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** below.

**Collaborators**: *list collaborators here*
* Joseph Beller
* Jessi Hudgins  


---
## Exercises


In [1]:
# To answer all of the exercises you will need to import the pandas package. 
# Please do that below, in this code block.
import pandas as pd

### Exercise 0 - Loading and Basic Exploration

In this assignment, we will use the `business_flights.csv` dataset found on the Canvas site. 

- Load the data into a `DataFrame` called `flights`.
- Include appropriate arguments to have `date` stored as a `datetime`. (This will likely result in a bunch of red error, just ignore them.)
- Display the first 5 rows of `flights`.

In [2]:
flights = pd.read_csv('/Users/carolinelpetersen/Desktop/DATA202/business_flights.csv')
flights.head()

Unnamed: 0,date,airline,ch_code,num_code,dep_time,from,time_taken,stop,arr_time,to,price
0,11-02-2022,Air India,AI,868,18:00,Delhi,02h 00m,non-stop,20:00,Mumbai,25612
1,11-02-2022,Air India,AI,624,19:00,Delhi,02h 15m,non-stop,21:15,Mumbai,25612
2,11-02-2022,Air India,AI,531,20:00,Delhi,24h 45m,1-stop,20:45,Mumbai,42220
3,11-02-2022,Air India,AI,839,21:25,Delhi,26h 30m,1-stop,23:55,Mumbai,44450
4,11-02-2022,Air India,AI,544,17:15,Delhi,06h 40m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,46690


### Exercise 0 Grading Notes

Deductions:
- You were asked to "Include appropriate arguments to have `date` stored as a `datetime`." (-3)


Exercise 0 Grade:

2/5

### Exercise 1 - Make Prices More Useful

Run the `.describe()` function on the `price` column of `flights. 

In [3]:
flights.price.describe()

count      93487
unique      2358
top       54,608
freq        1445
Name: price, dtype: object

There is not much helpful information there, lets find out why!

Display the data type of each column in `flights`.

In [4]:
flights.dtypes

date          object
airline       object
ch_code       object
num_code       int64
dep_time      object
from          object
time_taken    object
stop          object
arr_time      object
to            object
price         object
dtype: object

Why does our earlier command not give us min, max, and percentile values?

In [5]:
# The price column is stored as a string so it isn't viewed as numerical data for describe() to 
# run statistical analysis on

Try to cast the column as an `int` using the `astype` command.

What happens when you try to cast the `price` column as an `int`?

In [6]:
flights.price.astype(int)

ValueError: invalid literal for int() with base 10: '25,612'

In [7]:
# You get an error because the commas in the price column can't be converted to an integer

Python is not happy about us trying to cast as an `int` because there are commas in the prices. To handle this we cannot use Pandas built in specialty functions, we have to get creative and make our own casting method. 

Use the `map` function with an inline `lambda` function to do two things:
1. Remove the commas
2. Change each value in the `price` column to an integer. 

Hint: `my_string.replace(',', '')`

In [8]:
flights.price = flights.price.map(lambda d: int(d.replace(',', '')))

Display the type for the `price` column again.

In [9]:
flights.price.dtypes

dtype('int64')

Again, use the describe function to display valuable statistics about the `price` column of `flights`.

In [10]:
flights.price.describe()

count     93487.000000
mean      52540.081124
std       12969.314606
min       12000.000000
25%       45185.000000
50%       53164.000000
75%       60396.000000
max      123071.000000
Name: price, dtype: float64

### Exercise 1 Grading Notes

Exercise 1 Grade:

25/25

### Exercise 2 - Clean Up Number of Stops

Display the unique values and their respective number of occurances of the `stop` column in the `flights` DataFrame.  

In [11]:
flights.stop.value_counts()

1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t                   81487
non-stop                                                                      8102
2+-stop                                                                       1083
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia IDR\n\t\t\t\t\t\t\t\t\t\t\t\t              810
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia IXU\n\t\t\t\t\t\t\t\t\t\t\t\t              776
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia PAT\n\t\t\t\t\t\t\t\t\t\t\t\t              257
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Patna\n\t\t\t\t\t\t\t\t\t\t\t\t            242
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia BBI\n\t\t\t\t\t\t\t\t\t\t\t\t              152
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia STV\n\t\t\t\t\t\t\t\t\t\t\t\t               93
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia IXE\n\t\t\t\t\t\t\t\t\t\t\t\t               86
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Bhubaneswar\n\t\t\t\t\t\t\t\t\t\t\t\t       75
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Hyderabad\n\t\t\t\t\t\t\t\t\t\t\t\t         71
1-st

How many unique values do you think the `stop` column could contain?

In [12]:
# 25

Use the `map` function to update all of the values in `stop` to be 'zero', 'one', or 'two_or_more'. 

Make sure everything looks good by displaying the unique values and their respective number of occurances of the `stop` column in the `flights` DataFrame.  

In [13]:
def stop_map(stop):
    if "1-stop" in stop:
        return "one"
    elif stop == "2+-stop":
        return "two or more"
    else:
        return "zero"

flights.stop = flights.stop.map(stop_map)

flights.stop.value_counts()

one            84302
zero            8102
two or more     1083
Name: stop, dtype: int64

### Exercise 2 Grading Notes

Exercise 2 Grade:

25/25

### Exercise 3 - Make Flight Code
The `flight` dataset has a column for Flight Character Code, `ch_code`, and a column for Flight Numerical Code, `num_code`. We want to have one column that combines those two pieces of information into one Flight Code which we will store in a column called `flight_code`. For example, if a record has a `ch_code` of 'UK' and a `num_code` of '985' we want `flight_code` to store 'UK-985'.

Use the `apply` function to create and add the desired column to the `flight` DataFrame. 

- Take a look at the first 5 rows of flights now and make sure the column exists and looks good.
- Leave the `%%time` statement at the start of the cell to see how long the execution took.

In [14]:
%%time
def flight_code(row): 
    return f"{row['ch_code']} - {row['num_code']}"

flights['flight_code'] = flights.apply(flight_code, axis=1)

print(flights.head())

         date    airline ch_code  num_code dep_time   from time_taken  stop  \
0  11-02-2022  Air India      AI       868    18:00  Delhi    02h 00m  zero   
1  11-02-2022  Air India      AI       624    19:00  Delhi    02h 15m  zero   
2  11-02-2022  Air India      AI       531    20:00  Delhi    24h 45m   one   
3  11-02-2022  Air India      AI       839    21:25  Delhi    26h 30m   one   
4  11-02-2022  Air India      AI       544    17:15  Delhi    06h 40m   one   

  arr_time      to  price flight_code  
0    20:00  Mumbai  25612    AI - 868  
1    21:15  Mumbai  25612    AI - 624  
2    20:45  Mumbai  42220    AI - 531  
3    23:55  Mumbai  44450    AI - 839  
4    23:55  Mumbai  46690    AI - 544  
CPU times: user 235 ms, sys: 16 ms, total: 251 ms
Wall time: 248 ms


In [15]:
%%time
flights.flight_code = flights.ch_code.astype(str) + '-' + flights.num_code.astype(str)

print(flights.head())

         date    airline ch_code  num_code dep_time   from time_taken  stop  \
0  11-02-2022  Air India      AI       868    18:00  Delhi    02h 00m  zero   
1  11-02-2022  Air India      AI       624    19:00  Delhi    02h 15m  zero   
2  11-02-2022  Air India      AI       531    20:00  Delhi    24h 45m   one   
3  11-02-2022  Air India      AI       839    21:25  Delhi    26h 30m   one   
4  11-02-2022  Air India      AI       544    17:15  Delhi    06h 40m   one   

  arr_time      to  price flight_code  
0    20:00  Mumbai  25612      AI-868  
1    21:15  Mumbai  25612      AI-624  
2    20:45  Mumbai  42220      AI-531  
3    23:55  Mumbai  44450      AI-839  
4    23:55  Mumbai  46690      AI-544  
CPU times: user 41.7 ms, sys: 5.56 ms, total: 47.3 ms
Wall time: 44.8 ms


### Exercise 3 Grading Notes

Exercise 3 Grade:

25/25

### Exerise 4 - Explore Groupby
What is the minimum, maximum, and average cost of flights by airline? Sort your answer in descending order by mean value.

In [16]:
import numpy as np # This gives you access to np.mean
flights.groupby('airline').price.agg(['min', 'max', np.mean]).sort_values(by='mean', ascending=False)

Unnamed: 0_level_0,min,max,mean
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Vistara,17604,123071,55477.027777
Air India,12000,90970,47131.039212


### Exercise 4 Grading Notes

Exercise 4 Grade:

20/20

---
## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. I recommend going to the "Kernel" menu at the top and selecting "Restart & Run All". This will ensure that everything runs correctly when it is run sequentially. 

## Final Grade
97/100