<a href="https://colab.research.google.com/github/akaashpatel10/Complete-Python-3-Bootcamp/blob/master/Week6_data_prep_visualization_homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Prep and Visualization in Python

In this project, we'll work through munging a data set and creating visualizations related to trends in the airline industry in the middle of the last century. You'll get started using [MatPlotLib](https://matplotlib.org/), a very powerful and popular plotting library in Python that is covered in this week's course materials.

In [None]:
# Install the pydataset package. This package gives us data sets to work with very easily
! pip install pydataset



In [None]:
# The convention for importing matplotlib with an alias is "plt". We'll also need pandas and numpy

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## The Air Passengers Dataset

This dataset shows the number of passengers flying United States airlines by month from 1949-1960. Your job is to do various data munging operations on this dataset to clean it up and prepare it for several visualizations. You will then determine what code is needed to generate those visualizations.

In [None]:
from pydataset import data

passengers = data('AirPassengers')

Ugh. When we examine the head of this datset, we can see that the years are in decimal form rather than month and year. We'll need to change that before we can do our analysis.

NOTE: The times are represented by twelfths. i.e. 1949.00000 = 149 0/12 (January). 1949.083333 = 1949 1/12 (February), and so on.

In [None]:
passengers.head(12)

Unnamed: 0,time,AirPassengers
1,1949.0,112
2,1949.083333,118
3,1949.166667,132
4,1949.25,129
5,1949.333333,121
6,1949.416667,135
7,1949.5,148
8,1949.583333,148
9,1949.666667,136
10,1949.75,119


## The decimal years complicate the EDA work

We need to deal with this by making explicit month and year columns. It is common to have to reformat columns like this in a dataframe

## #1 Add a 'year' column to passengers that reflects the current year

In [None]:
# TODO
passengers['year'] = passengers['time'].astype(int)

passengers.head(12)

Unnamed: 0,time,AirPassengers,year
1,1949.0,112,1949
2,1949.083333,118,1949
3,1949.166667,132,1949
4,1949.25,129,1949
5,1949.333333,121,1949
6,1949.416667,135,1949
7,1949.5,148,1949
8,1949.583333,148,1949
9,1949.666667,136,1949
10,1949.75,119,1949


## #2 Add a "month" column

Set this up in such a way that January is represented with a 1, February with a 2, etc.

*Hint: create a new data frame with the months and their decimal equivalents, and then use a join.*

In [None]:
# TODO

passengers['month_number'] = (passengers['time']-passengers['year'])*12

data = {'month_number': [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0], 'month': ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August','September', 'October', 'November', 'December']}

month_mapping = pd.DataFrame.from_dict(data)

month_mapping.head(12)

passengers_clean = passengers.merge(month_mapping, on='month_number')

passengers_clean.head(12)

Unnamed: 0,time,AirPassengers,year,month_number,month
0,1949.25,129,1949,3.0,March
1,1950.25,135,1950,3.0,March
2,1951.25,163,1951,3.0,March
3,1952.25,181,1952,3.0,March
4,1953.25,235,1953,3.0,March
5,1954.25,227,1954,3.0,March
6,1955.25,269,1955,3.0,March
7,1956.25,313,1956,3.0,March
8,1957.25,348,1957,3.0,March
9,1958.25,348,1958,3.0,March


## #3 Generate the plot below of passengers vs. time using each monthly count

<a href='https://drive.google.com/file/d/1PdaXbkCVzUXBnUP6c6cLP3nZ94ShSLg1/view?usp=embed_facebook&source=ctrlq.org'><img src='https://lh4.googleusercontent.com/7EHckqyjefS7rN8-gAtj2SgSyKfV3wlTnGKqCwzOf85F6NYlqYQbz7bDfWw=w2400' /></a>

In [None]:
# TODO

## #4 Generate the plot below of passengers vs. time using an annual count

<a href='https://drive.google.com/file/d/19WYHQR7sFgaeN5ZHlwx5x1-o-wxJ4weW/view?usp=sharing&amp;usp=embed_facebook&source=ctrlq.org'><img src='https://lh4.googleusercontent.com/2gbHNgm8UhbCEevaUBpMUSvVgk_6QuxMASqn9-wK1NdzrDXrcF-VIWK_o08=w2400' /></a>

In [None]:
# TODO

## #5 Generate the barplot below of passengers by year

<a href='https://drive.google.com/file/d/1-4NF40zvVhwi6RWagJu98BaBuDNOXaEd/view?usp=sharing&amp;usp=embed_facebook&source=ctrlq.org'><img src='https://lh6.googleusercontent.com/IQRk35KApDIxYtHGH3WoczLnCvHCRdMNlHw64rgLWPYUostOoAn2hxp8lZA=w2400' /></a>

In [None]:
# TODO

## #6 Generate the histogram below of monthly passengers

**Additional requirements:**

* Only include 1955 and beyond
* Use a binwidth of 50, a min of 200, and a max of 700
* Set the yticks to start at 0, end at 25 by interval of 5

<a href='https://drive.google.com/file/d/1mEtvUbnh2LcDDc73LNr_qX984HzgyhiQ/view?usp=sharing&amp;usp=embed_facebook&source=ctrlq.org'><img src='https://lh6.googleusercontent.com/7I2FzRPSQPyoalFcwH3vTDeB9Gf80OUlaZOs1x9oRRYyQLlHXPU9H-NhSVQ=w2400' /></a>

In [None]:
# TODO

## #7 Generate the histogram below of monthly passengers

**Additional requirements:**

* Generate two groups to compare. Group 1 should be the years 1949-1950. Group 2 should be the years 1959-60.
* Binwidth of 50 from 100 to 700
* yticks from 0 to 24, spaced by 2
* Be sure to include a legend

<a href='https://drive.google.com/file/d/1gqJbBVOPIurYikUIDpXoAF3gZx2p8lUA/view?usp=sharing&amp;usp=embed_facebook&source=ctrlq.org'><img src='https://lh3.googleusercontent.com/Ok91nFY8Srjn1FpVwOil9ycH9y6isZejTqi7hifqaEA5E3tWpkwldWVLo3U=w2400' /></a>

In [None]:
# TODO

## #8 Generate the time plot below

**Additional requirements:**

* Compare 1950, 1955, and 1960 by month

<a href='https://drive.google.com/file/d/11nVH5EiYxxtJ48isS9VLtwLIjn0hALXV/view?usp=sharing&amp;usp=embed_facebook&source=ctrlq.org'><img src='https://lh3.googleusercontent.com/SKfWqBE324A__VS8V-TBqMQXHWE9OUjVoJyeyJME8uJzyfWS73aaCms7A3c=w2400' /></a>

In [None]:
# TODO

## #9  Understand your data and tell a story

* Which of these plots would you create first to explore your data before building a model or performing an analysis? Why?
* If you could only use one of these plots to tell a story about air travel trends mid-centry, which would you use and why? What are some insights you could share?


In [None]:
# TODO