# Data Science for Beginners - Part 2
Now that you have your environment running, we can do some stuff. We'll use this dataset:

[*2015 Flight Delays and Cancellations* provided by the U.S. Department of Transportation](https://www.kaggle.com/usdot/flight-delays). 

It is best to follow along with a running Jupyter environment - this can be local or on Azure Notebooks. It is also best to be running this on a cloned repo or library so that you'll also have the data accessible. 

**Note:** if on Azure notebooks - you would probably need to use a subset of the data because `flights.csv` may be too large for Azure to handle. In *Section 2*, run the other cell.

# 1. Import Modules and Data

In [2]:
%matplotlib inline

import pandas as pd # Panel data processing
import numpy as np # Library for numerical data
import matplotlib.pyplot as plt # Plots
import matplotlib 

# 2. Read Data from File

**We host this data on an S3 bucket, so it can take a while to download. You can choose to just download the file yourself and change the path.**

First we load our csv with `pd.read_csv()`, which is a very good csv parser. This returns a pandas `DataFrame` object, of which dimensions are accessible with `.shape`. 

`.head()` gives us a preview of our data- by showing the first 5 rows. 

In jupyter notebook, any variable called on the last line is automatically printed, so there's no need to call `print(flights.head())`.

### FOR AZURE NOTEBOOKS

In [3]:
flights = pd.read_csv('https://s3.amazonaws.com/vandyhacks/datascience/flights/flights_sample.csv') # Load the csv from our s3 bucket
# 80% sample of original data

print("Dimensions: {r} rows, {c} cols".format(r=flights.shape[0], c=flights.shape[1])) 
flights.head()

### FOR LOCAL/Other

In [None]:
flights = pd.read_csv('https://s3.amazonaws.com/vandyhacks/datascience/flights/flights.csv') # Load the csv from our s3 bucket
# This might take a while since the file is about 500mb big (in future just download onto computer)

print("Dimensions: {r} rows, {c} cols".format(r=flights.shape[0], c=flights.shape[1])) 
flights.head()

## ALL notebooks run from here down

In [23]:
airlines = pd.read_csv('https://s3.amazonaws.com/vandyhacks/datascience/flights/airlines.csv')
airlines.head() 

Unnamed: 0,IATA_CODE,AIRLINE
0,UA,United Air Lines Inc.
1,AA,American Airlines Inc.
2,US,US Airways Inc.
3,F9,Frontier Airlines Inc.
4,B6,JetBlue Airways


# 3. Remove Columns with Empty Values

In [None]:
flights = flights.dropna(axis=1, thresh= 0.8 * flights.shape[0]) # Drop any columns containing NaN, but keep columns with at least 80% non-NA values

print("Dimensions: {r} rows, {c} cols".format(r=flights.shape[0], c=flights.shape[1]))
flights.head()

# 4. Add 'DATE' and 'TOTAL_DELAY' Columns

In pandas, you can simply make a new column with assignment. For example:

```
df['a'] = df['b'] + df['c']
```
For every row of a DataFrame `df`, the value in col 'a' is the sum of values in columns 'b' and 'c' on the same row.

Also, you can access columns of a DataFrame like such. Autocomplete with tab works pretty well in Jupyter.

A `DateTime` object is another pandas object for holding date-time information. The `to_datetime` function is pretty flexible in converting Series, strings, and other items into DateTime objects.


The next piece of code is pretty nutty:

```flights.groupby('DATE').mean()``` returns a new DataFrame. This aggregates the entire `flights` DataFrame column-wise by taking the mean of each column grouped by matching values in the 'DATE' column. The grouped 'DATE' column is now the new index.

```pd.DataFrame(flights.groupby('DATE').mean(), columns=['DEPARTURE_DELAY', 'ARRIVAL_DELAY', 'TOTAL_DELAY'])``` creates a DataFrame by basically subsetting the columns we specify. We can also do this by subsetting with a list:

```new_df = flights.groupby('DATE').mean()[['DEPARTURE_DELAY', 'ARRIVAL_DELAY', 'TOTAL_DELAY']]``` 