# Springboard Data Science Career Track Unit 4 Challenge - Tier 3 Complete

## Objectives
Hey! Great job getting through those challenging DataCamp courses. You're learning a lot in a short span of time. 

In this notebook, you're going to apply the skills you've been learning, bridging the gap between the controlled environment of DataCamp and the *slightly* messier work that data scientists do with actual datasets!

Here’s the mystery we’re going to solve: ***which boroughs of London have seen the greatest increase in housing prices, on average, over the last two decades?***


A borough is just a fancy word for district. You may be familiar with the five boroughs of New York… well, there are 32 boroughs within Greater London [(here's some info for the curious)](https://en.wikipedia.org/wiki/London_boroughs). Some of them are more desirable areas to live in, and the data will reflect that with a greater rise in housing prices.

***This is the Tier 3 notebook, which means it's not filled in at all: we'll just give you the skeleton of a project, the brief and the data. It's up to you to play around with it and see what you can find out! Good luck! If you struggle, feel free to look at easier tiers for help; but try to dip in and out of them, as the more independent work you do, the better it is for your learning!***

This challenge will make use of only what you learned in the following DataCamp courses: 
- Prework courses (Introduction to Python for Data Science, Intermediate Python for Data Science)
- Data Types for Data Science
- Python Data Science Toolbox (Part One) 
- pandas Foundations
- Manipulating DataFrames with pandas
- Merging DataFrames with pandas

Of the tools, techniques and concepts in the above DataCamp courses, this challenge should require the application of the following: 
- **pandas**
    - **data ingestion and inspection** (pandas Foundations, Module One) 
    - **exploratory data analysis** (pandas Foundations, Module Two)
    - **tidying and cleaning** (Manipulating DataFrames with pandas, Module Three) 
    - **transforming DataFrames** (Manipulating DataFrames with pandas, Module One)
    - **subsetting DataFrames with lists** (Manipulating DataFrames with pandas, Module One) 
    - **filtering DataFrames** (Manipulating DataFrames with pandas, Module One) 
    - **grouping data** (Manipulating DataFrames with pandas, Module Four) 
    - **melting data** (Manipulating DataFrames with pandas, Module Three) 
    - **advanced indexing** (Manipulating DataFrames with pandas, Module Four) 
- **matplotlib** (Intermediate Python for Data Science, Module One)
- **fundamental data types** (Data Types for Data Science, Module One) 
- **dictionaries** (Intermediate Python for Data Science, Module Two)
- **handling dates and times** (Data Types for Data Science, Module Four)
- **function definition** (Python Data Science Toolbox - Part One, Module One)
- **default arguments, variable length, and scope** (Python Data Science Toolbox - Part One, Module Two) 
- **lambda functions and error handling** (Python Data Science Toolbox - Part One, Module Four) 

## The Data Science Pipeline

This is Tier Three, so we'll get you started. But after that, it's all in your hands! When you feel done with your investigations, look back over what you've accomplished, and prepare a quick presentation of your findings for the next mentor meeting. 

Data Science is magical. In this case study, you'll get to apply some complex machine learning algorithms. But as  [David Spiegelhalter](https://www.youtube.com/watch?v=oUs1uvsz0Ok) reminds us, there is no substitute for simply **taking a really, really good look at the data.** Sometimes, this is all we need to answer our question.

Data Science projects generally adhere to the four stages of Data Science Pipeline:
1. Sourcing and loading 
2. Cleaning, transforming, and visualizing 
3. Modeling 
4. Evaluating and concluding 


### 1. Sourcing and Loading 

Any Data Science project kicks off by importing  ***pandas***. The documentation of this wonderful library can be found [here](https://pandas.pydata.org/). As you've seen, pandas is conveniently connected to the [Numpy](http://www.numpy.org/) and [Matplotlib](https://matplotlib.org/) libraries. 

***Hint:*** This part of the data science pipeline will test those skills you acquired in the pandas Foundations course, Module One. 

#### 1.1. Importing Libraries

In [None]:
# Let's import the pandas, numpy libraries as pd, and np respectively. 
import pandas as pd
import numpy as np

# Load the pyplot collection of functions from matplotlib, as plt 
import matplotlib.pyplot as plt


#### 1.2.  Loading the data
Your data comes from the [London Datastore](https://data.london.gov.uk/): a free, open-source data-sharing portal for London-oriented datasets. 

In [None]:
# First, make a variable called url_LondonHousePrices, and assign it the following link, enclosed in quotation-marks as a string:
# https://data.london.gov.uk/download/uk-house-price-index/70ac0766-8902-4eb5-aab5-01951aaed773/UK%20House%20price%20index.xls

url_LondonHousePrices = "https://data.london.gov.uk/download/uk-house-price-index/70ac0766-8902-4eb5-aab5-01951aaed773/UK%20House%20price%20index.xls"

# The dataset we're interested in contains the Average prices of the houses, and is actually on a particular sheet of the Excel file. 
# As a result, we need to specify the sheet name in the read_excel() method.
# Put this data into a variable called properties.  
properties = pd.read_excel(url_LondonHousePrices, sheet_name='Average price', index_col= None)

### 2. Cleaning, transforming, and visualizing
This second stage is arguably the most important part of any Data Science project. The first thing to do is take a proper look at the data. Cleaning forms the majority of this stage, and can be done both before or after Transformation.

The end goal of data cleaning is to have tidy data. When data is tidy: 

1. Each variable has a column.
2. Each observation forms a row.

Keep the end goal in mind as you move through this process, every step will take you closer. 



***Hint:*** This part of the data science pipeline should test those skills you acquired in: 
- Intermediate Python for data science, all modules.
- pandas Foundations, all modules. 
- Manipulating DataFrames with pandas, all modules.
- Data Types for Data Science, Module Four.
- Python Data Science Toolbox - Part One, all modules

**2.1. Exploring your data** 

Think about your pandas functions for checking out a dataframe. 

In [None]:
# Understanding what the dataframe looks like -- number of rows and columns
properties.shape

In [None]:
# .head() shows us a small sample of the data, default value is 5
properties.head()

In [None]:
# .tail() shows us a small sample of the data, default value is 5
properties.tail()

**2.2. Cleaning the data**

You might find you need to transpose your dataframe, check out what its row indexes are, and reset the index. You  also might find you need to assign the values of the first row to your column headings  . (Hint: recall the .columns feature of DataFrames, as well as the iloc[] method).

Don't be afraid to use StackOverflow for help  with this.

In [None]:
# Transpose to have districts as indices and time as columns, df.T or df1.transpose()
properties_transposed = properties.T

In [None]:
# Check what its rows indexes are
properties_transposed.head()

In [None]:
# Check what our row indexes are - info we want to analyze should NEVER be in the index!
properties_transposed.index

In [None]:
# verify index 
#Index: 49 entries, Unnamed: 0 to England ---> Need to reset index


properties_transposed.info(verbose=True, null_counts=True)

In [None]:
# Use reset_index()

properties_transposed = properties_transposed.reset_index()

In [None]:
# Check the dataframe
properties_transposed.head()

In [None]:
# To confirm that our DataFrame's columns are mainly just integers, call the .columns feature on our DataFrame:
properties_transposed.columns

In [None]:
# The iloc[] method with double square brackets on the properties_transposed DataFrame, to see the row at index 0. 
properties_transposed.iloc[[0]]

In [None]:
# Reassign index columns to dates
properties_transposed.columns = properties_transposed.iloc[0]

In [None]:
# see dataframe properties
properties_transposed.head()

In [None]:
# Need to drop the row at index 0  
properties_transposed = properties_transposed.drop(0)

In [None]:
# We check the df one more time to see how it looks
properties_transposed.head()

**2.3. Cleaning the data (part 2)**

You might we have to **rename** a couple columns. How do you do this? The clue's pretty bold...

In [None]:
# Renaming columns'Unnamed' and 'NaN'
properties_transposed = properties_transposed.rename(columns = {'Unnamed: 0':'London_Districts', pd.NaT: 'ID'})

In [None]:
# Checking how the data looks
properties_transposed.head()

**2.4.Transforming the data**

Remember what Wes McKinney said about tidy data? 

You might need to **melt** your DataFrame here. 

In [None]:
# Most important properties of tidy data are: Each column is a variable. Each row is an observation
# Result: a DataFrame with rows representing the average house price within a given month and a given district


clean_properties = pd.melt(properties_transposed, id_vars= ['London_Districts', 'ID'])

In [None]:
# check the data
clean_properties.head(400)

In [None]:
# Re-name the column names
clean_properties = clean_properties.rename(columns = {0: 'Date', 'value': 'Average_price'})

In [None]:
# Check the data
clean_properties.head()

Remember to make sure your column data types are all correct. Average prices, for example, should be floating point numbers... 

In [None]:
# check if the data types are all correct
clean_properties.dtypes

In [None]:
# Average_price is NOT float type, so we need to change that by using to_numeric() 

clean_properties['Average_price'] = pd.to_numeric(clean_properties['Average_price'])

In [None]:
# Check the data to make sure Average_price is now a float

clean_properties.dtypes

In [None]:
# To see if there are any missing values, we should call the count() method on our DataFrame:

clean_properties.count()

**2.5. Cleaning the data (part 3)**

Do we have an equal number of observations in the ID, Average Price, Month, and London Borough columns? Remember that there are only 32 London Boroughs. How many entries do you have in that column? 

Check out the contents of the London Borough column, and if you find null values, get rid of them however you see fit. 

In [None]:
# There are mismatches between the number of entries in each column, we know there are 32 districts in London
# We can check for unique values on London_Districts column

clean_properties['London_Districts'].unique()

In [None]:
# There are many entries that are NOT London districts: 'Unnamed: 34','Unnamed: 37','NORTH EAST','NORTH WEST'
#'YORKS & THE HUMBER','EAST MIDLANDS','WEST MIDLANDS','EAST OF ENGLAND','LONDON','SOUTH EAST','SOUTH WEST'
#'Unnamed: 47','England'----> Next: We check what values these entries hold, if nothing valuable, it can be dropped

# Subset clean_properties on the condition: df['London_Borough'] == 'Unnamed: 34' to see the info it contains
clean_properties[clean_properties['London_Districts'] == 'Unnamed: 34'].head()


In [None]:
# Do the same for the other NOT London districts
clean_properties[clean_properties['London_Districts'] == 'Unnamed: 37'].head()

In [None]:
# Goal is to delete both entries that don't carry any info
# check how many rows have NaN as value for ID column

clean_properties[clean_properties['ID'].isna()]


In [None]:
# Dealing with Null values --> 1st: filtering
# notna() will return a series of booleans, returns false if there is a null

NaN_Free_DF1 = clean_properties[clean_properties['Average_price'].notna()]
NaN_Free_DF1.head(10)


In [None]:
# see how many rows we have that have complete information: 
NaN_Free_DF1.count()

In [None]:
# Looks Good!!! The count shows there are 13680 entries for each column
# Now, use dropna() to drop all null values

# filtering the data with NaN values
NaN_Free_DF2 = clean_properties.dropna()
NaN_Free_DF2.head(10)

In [None]:
# Let's do a count on this DataFrame object: 
NaN_Free_DF2.count()

In [None]:
NaN_Free_DF2['London_Districts'].unique()

In [None]:
# Using the .shape attribute, compare the dimenions of clean_properties, NaN_Free_DF1, and NaN_Free_DF2: 
print(clean_properties.shape)
print(NaN_Free_DF1.shape)
print(NaN_Free_DF2.shape)

In [None]:
# Drop the other Non London Districts

non_Districts = ['Inner London', 'Outer London', 
               'NORTH EAST', 'NORTH WEST', 'YORKS & THE HUMBER', 
               'EAST MIDLANDS', 'WEST MIDLANDS',
              'EAST OF ENGLAND', 'LONDON', 'SOUTH EAST', 
              'SOUTH WEST', 'England']

In [None]:
# Filter Nan_Free_DF2 first on the condition that the rows' values for London_Districts is in the non_Districts list
 
NaN_Free_DF2[NaN_Free_DF2.London_Districts.isin(non_Districts)]

In [None]:
# Get just those rows whose values for London_Districts is NOT in the non_Districts
# Just put the negation operator ~ before the filter statement


NaN_Free_DF2[~NaN_Free_DF2.London_Districts.isin(non_Districts)]

In [None]:
# Execute the reassignment 

NaN_Free_DF2 = NaN_Free_DF2[~NaN_Free_DF2.London_Districts.isin(non_Districts)]

In [None]:
# Check the data
NaN_Free_DF2.head(10)

In [None]:
NaN_Free_DF2.count()

In [None]:
# We finally have our df

df = NaN_Free_DF2

In [None]:
df.head()

In [None]:
df.dtypes

**2.6. Visualizing the data**

To visualize the data, why not subset on a particular London Borough? Maybe do a line plot of Month against Average Price?

In [None]:
# I will subset Brent District prices -- assign it the result of filtering the df 
Brent_Prices = df[df['London_Districts'] == 'Brent']



In [None]:
# Line Plot for visualization with parameteres: kind ='line', x = 'Date', y='Average_price'
Brent_Plot = Brent_Prices.plot(kind ='line', x = 'Date', y = 'Average_price')

# The set_ylabel() method set that label to the string: 'Price'
Brent_Plot.set_ylabel('Price')


To limit the number of data points you have, you might want to extract the year from every month value your *Month* column. 

To this end, you *could* apply a ***lambda function***. Your logic could work as follows:
1. look through the `Month` column
2. extract the year from each individual value in that column 
3. store that corresponding year as separate column. 

Whether you go ahead with this is up to you. Just so long as you answer our initial brief: which boroughs of London have seen the greatest house price increase, on average, over the past two decades? 

In [None]:
# Try this yourself. 
df['Year'] = df['Date'].apply(lambda month: month.year)

# Call the tail() method on df
df.tail()

In [None]:
# We want to calculate the yearly mean house price for London Districts
# The groupby() method returns London_Districts and Year as indices

df_yearly_price = df.groupby(['London_Districts','Year']).mean()

# Check the data
df_yearly_price.head(10)

In [None]:
# Reset the index for our new DataFrame df_yearly_price, and call the head() method on it

df_yearly_price = df_yearly_price.reset_index()
df_yearly_price.head()

**3. Modeling**

Consider creating a function that will calculate a ratio of house prices, comparing the price of a house in 2018 to the price in 1998.

Consider calling this function create_price_ratio.

You'd want this function to:
1. Take a filter of dfg, specifically where this filter constrains the London_Borough, as an argument. For example, one admissible argument should be: dfg[dfg['London_Borough']=='Camden'].
2. Get the Average Price for that Borough, for the years 1998 and 2018.
4. Calculate the ratio of the Average Price for 1998 divided by the Average Price for 2018.
5. Return that ratio.

Once you've written this function, you ultimately want to use it to iterate through all the unique London_Boroughs and work out the ratio capturing the difference of house prices between 1998 and 2018.

Bear in mind: you don't have to write a function like this if you don't want to. If you can solve the brief otherwise, then great! 

***Hint***: This section should test the skills you acquired in:
- Python Data Science Toolbox - Part One, all modules

In [None]:
# Creat a function to calculate the ratio of house prices that compare 2018 to 1998 house prices
# The function takes distritcs from the London_Districts columns as arguments

def create_price_ratio(arg):
    avg_price_1998 = float(arg['Average_price'][arg['Year']==1998]) #what is this line doing?
    avg_price_2018 = float(arg['Average_price'][arg['Year']==2018])
    ratio = avg_price_2018/avg_price_1998
    return ratio


In [None]:
# Let's test the function passing on the following argument

create_price_ratio(df_yearly_price[df_yearly_price['London_Districts']=='Barking & Dagenham'])

In [None]:
# We want to do the above with all the London Districts, that is calculate the ratio 2018/1998 for each unique district
# Create an empty dict 

Ratio_per_District = {}

In [None]:
# A for loop that will iterate through each of the unique districts of the 'London_Districts' column of the DataFrame df_yearly_price
# .unique() returns unique values in the df

for name in df_yearly_price['London_Districts'].unique():
    
    district = df_yearly_price[df_yearly_price['London_Districts'] == name]
    Ratio_per_District[name] = create_price_ratio(district)
    
print(Ratio_per_District) 

In [None]:
# Make a variable called df_ratios, and assign it the result of calling the DataFrame method on the dictionary final. 
df_ratios = pd.DataFrame(Ratio_per_District, index=[0])

In [None]:
# check the data
df_ratios.head()

In [None]:
# Now we transpose the df_ratios, and reset the index! 
df_ratios_T = df_ratios.T
df_ratios = df_ratios_T.reset_index()
df_ratios.head()

In [None]:
# Rename the 'index' column as 'London_Districts', and the '0' column to '2018'
df_ratios.rename(columns={'index':'London_Districts', 0:'2018'}, inplace=True)
df_ratios.head()

In [None]:
# Sort in descending order and select the top 10 districts
# Make a variable called top10, assign it the result of calling sort_values() on df_ratios. 
top10 = df_ratios.sort_values(by='2018',ascending=False).head(10)
print(top10)


In [None]:
# Finally, it's time to plot the districts that have seen the greatest changes in price
# Make a variable called a bar_graph assign it the result of filtering top10 on 'London_Districts' and '2018'
# then calling plot(), with the parameter kind = 'bar'

bar_graph = top10[['London_Districts','2018']].plot(kind='bar')

bar_graph.set_xticklabels(top10.London_Districts)

### 4. Conclusion

The housing price in London districts increased at least four times over the last 20 years. In Hackney, the most expensive district in London as of 2018, the average house price increased 6.2 times compared to the 1998 house market. To buy a house in one of the top 10 most expensive districts in London today means paying on average 5 times what that same house cost 20 years ago.