# Tutorial 1B Hands on data - wrangling data with Python & pandas



In two parts:
* reproducing the DataWrangler process (using the same 'Air crashes' data) and 
* bad, bad data investigations


** Before you start, a most important thing to do, check your python version**

In [None]:
import sys
print (sys.version_info)

In [None]:
import pandas as pd # not the bamboo eating bear... 'Panel Data' 

# Part 1 
## Step 1 Read data 

In [None]:
# load data
df = pd.read_csv('AirCrashes.csv') # df is a dataframe, confirm with: type(df)
df.shape

#### How many lines of data?
#### How many did you get with DataWrangler?

Have a look a the first few rows:

That's not quite right.. the first line has been stolen for the title/header
#### Does read_csv ignore empty lines? 
e.g. line 18 should be blank

skip_blank_lines = True (the default) see:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Try again, force our own headers upon the data using default column names from DataWrangler (split/extract etc.) 

This pushes the first incident down into the data where it belongs (but adds a NaN, below)

In [None]:
df = pd.read_csv('AirCrashes.csv', names = ['split', 'split1']) 
# column names (split, split1) replicate DataWrangler column names
df.head(10)

## Step 2 Extract index(s)

### Extract flight information
The flight information is between "Incident" and "involving" in the "split" cell.
Now, extract flights based on one of the suggestions from DataWrangler, i.e., "(Incident (.*) involving)".

In [None]:
# treat the 'split' column as a str, then 
# use the extract method on the str
df['split'].str.extract("Incident (.*) involving")

### That seems to have worked...

We got 'American Airlines Flight 11' and 'United Airlines Flight 175' etc but lost all the other data and gained a bunch of NaNs

#### What is this 'str.extract' code anyway? (add a comment to decode or explain it)

extract("Incident (.*) involving")

### But we want planes in a new column, we can create one called 'extract' like this:

df['extract'] = df['split'].str.extract("Incident (.*) involving")

But this would be the last column, we want it in the second (location is not critical but it can be done so why not). 

Now, use the DataFrame's insert function.

In [None]:
df.head(20)

So there's the flight information in it's own column, plus a whole lot of NaNs

We could replace all the NaN with spaces or similar but they can wait

Now we want the aircraft in it's own column, similar to above, 
based on the suggestion from DataWrangler, note spaces in "\ a (.*)\ in "

#### but is this optimal?  

In [None]:
# and repeat to get the aircraft type that appears between "a" and "in"

# and df.head(20) to confirm
df.head(20)

Drop the empty lines 

In [None]:
df = df.dropna(how='all') 
df

## Step 3  'Fill down'

We want to 'fill down' the indexes  (e.g. lines 1 to 15 should be associated with line 0)

There are several options
* na.locf() method from zoo package. 
* ddply() from plyr
* bfill()
* fillna()


In [None]:
# magic, take the previous value (not NaN) and fill down
df =
# http://pandas.pydata.org/pandas-docs/stable/missing_data.html#filling-missing-values-fillna

In [None]:
df.head(200)

## Step 4 Remove the index row

We need to delete all the 'incident' rows, they have served their purpose and are now redundant. 

In [None]:
# use str.contain function to get the row index.
# keep everything that doesn't have "Incident" in it 
df = 
# do we need to worry about a plane called "Incident" or "Incident weather" etc???

In [None]:
# and check with e.g. df.shape 
print df.shape
df

## Now to 'unfold', there are several options

* melt()
* stack, unstack?
* pivot_table()
* pivot()

http://pandas.pydata.org/pandas-docs/stable/reshaping.html

In [None]:
data = df.pivot('extract', 'split', 'split1') 
# the parameters above are (left to right) index, columns, values 
# You can also write it like this:
#data = df.pivot(index = 'extract', columns = 'split', values = 'split1') 

In [None]:
data.shape 

In [None]:
data.head() # 58 records, good, but lost plane type, bad
# where's 'extract1' - can we have multiple indexes or have to put that data back in?

### Problem:

we have two columns we want to pivot on, 'extract' & 'extract1' (AKA flight & plane) but

pivot() can't have multiple indexes...

e.g. df.pivot(index = ['extract','extract1'], 'split', 'split1') # error

pivot_table() can but insists on doing some accounting or aggreagating too, like sum or avg, which we don't need

e.g. pd.pivot_table(df, values='split1', index = ['extract','extract1'], columns='split') # error, no function 

### Solutions? 

use pivot() then force the other column back into the data?? 

or trick pivot_table() into doing some pointless accounting (that adds up to nothing)??

try something else... stack, unstack, group, dplyr?

DIY code??



![](http://www.desktopimages.org/pictures/2014/0212/1/orig_150933.jpg)

In [None]:
# solution: make the function a copy, x = x
# data = pd.pivot_table(df, index=["extract","extract1"], columns = 'split', values = 'split1', aggfunc = lambda x: x)   
# or 
data = pd.pivot_table(df, index=["extract","extract1"], columns = 'split', values = 'split1', aggfunc = 'max') 
# ha, cop that
# http://stackoverflow.com/questions/19279229/pandas-pivot-table-with-non-numeric-values-dataerror-no-numeric-types-to-ag

data # not using df anymore, keep it as backup

### That's most of the wrangling as was done with DataWrangler, there are a few more optional steps:
* want the manufacturer e.g. Beoing?
* remove 'extract' & 'extract1'?
* rename columns ('split') 
* extract year into new column 
* export e.g. df.to_csv(file_name, sep=',')

In [None]:
#data.reset_index()

# Part 2 

## Wait there's more:

Bad, bad data

This data has been deliberately damaged (sorry)

Some are obvious, some are subtle (some were already there... e.g. look for 'Â')

### See if you can find them

In [None]:
# start with a summary table
data.describe()

### describe() shows:

* count - we can see that there are 58 records across the board, no surprise (what would it mean if there were non 58s?)
* unique - looks like all the 'Casualties' are identical (unique = 1, i.e. all 'Extremely High'), maybe this column is redundant?
* top - interesting, there were two major disasters in the exact same place? Check dates?
* freq - also interesting, 3 times there were 156 passengers on flights... superstitious? Or is it bad data?

So some clues here, dig deeper:

In [None]:
#We can also describe individual columns:
data['Crew'].describe()

In [None]:
# nothing new here, the most common number crew size is 14 (7 times)
# what's the biggest crew?
data['Crew'].max()


Why 9 (or even '9')?

How can this max be less than 14?

Are these even numbers?


In [None]:
data['Crew'].mean() # expecting ~15, they were all big planes?

This is weird

If they are numbers it should be higher

If they are not numbers, what does mean 'mean'? 

In [None]:
data[data['Crew'] == 14] # look at all the crew = 14 planes, should be 7

In [None]:
# there are none... try this:
data[data['Crew'] == '14'] 

In [None]:
# so there's a clue, 14 vs '14' 
# what are these data types anyway?
data.dtypes

### All data are of type object... 
### Overruled:

In [None]:
data = data.convert_objects(convert_numeric = True) 
# this is a bit brutal, can you convert when data is loaded?

In [None]:
data.dtypes

In [None]:
# so now we have some numbers (int & float)
data.describe()

### Better, more information, describe() now shows:
* count - as above, all 58
* mean - crew ~15, seems OK, but mean lat & long doesn't mean much... or does it?
* std - Standard Deviation
* min - now we see some problems, zero crew? Was this a way to code a hijacking, or is it missing, or should it be 10, 20, 30?
* 25, 50 & 75% are quartiles...
* max - crew 181, no way! 1692 dead, no plane is that big... or could this be Lockerbie, i.e. plane hit town

Let's investigate the crew data:

In [None]:
data[data['Crew'] > 20] # try also e.g. < 10

So Boeing 747s have large crews...
#### 33 crew is that possible? Or is that two flights?

#### 181 crew? Same value as for 'Total dead', can you derive crew from dead minus passengers?
(or does 'Some survivors' corrupt the maths?)


# Another way to explore... plot that data:


In [None]:
%matplotlib inline 
# notebook majik to display plots in the notebook

data[['Crew','Passengers']].plot(x= 'Crew',y= 'Passengers',kind= 'scatter')

#### And there's the extreme outlier
#### Are there any others?

In [None]:
# so how to put a number to an outlier?
data['Crew'].max() # works, now that they are numbers

In [None]:
data[data['Crew']==data['Crew'].max()]

## Plotting non-numeric data

In [None]:
data['Phase'].value_counts().plot(kind='bar')
# you can guess some of these codes
# ENR = en route?
# APR = Aproach
# Takeoff, Landing
# ICL?

In [None]:
# not much point in plotting Casualties... so what the hey
data['Casualties'].value_counts().plot(kind='bar')

# To do: find any other data problems (there are about 10)

<br>

![](http://media1.popsugar-assets.com/files/thumbor/yjoSwHRBZ4MpTO3TN6lvI_gsKMI/fit-in/2048xorig/filters:format_auto-!!-:strip_icc-!!-/2016/03/02/901/n/1922283/01f64bd801c06153_game4/i/When-Everyone-Trying-Talk-You-Youre-Too-Hungry-Care.gif)

#### Post your suspected bad data cases in Moodle discussion forums, how you found it, and suggested fixes
(one or two each, share the load)


In [None]:
# go crazy

#### Can data be 'typed' as it is read in? 
(yes see 'dtype')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
#### What happens if this process encounters bad data?

#### So why is 'Crew' max '9' above?

#### When a plane hits another plane is that one record or two?

#### Can DataWrangler do this sort of wrangling? 

