# Capstone Project Part 2: Open Challenge

For part 2 of your Capstone Project assignment, I want you to submit your own Jupyter Notebook written from scratch. I also want you to select your own data source **and** *your own questions* to ask about the data you have selected.

This sounds difficult -- and it is. But the point here is to give you the experience in exploring data yourself and understanding that a big part of data science is in asking questions and exploring on your own. Who knows, you might find something interesting and valuable enough that this time next year you could be CEO of your own multimillion pound start-up!

Think back to exercise 8 (London 2012 Olympics data) and the kinds of questions I set for you in that challenge. This time however, I want you to demonstrate as much of what you have learned in this course as possible. In particular, I want you to create a Jupyter Notebook that demonstrates the following:
 - Gathering data from a data source. You could do this programmatically (e.g. with a Python library querying an API such as `tweepy`), or just downloaded from somewhere. If the latter, please add some text describing where you got the data from and why you thought it might be interesting.
 - Data formatting and cleaning. If your data is semi-structured and not already in a CSV, it would be great to see how you mapped it across using some string formatting. Also examples of data cleaning -- removing spurious values or dealing with missing values.
 - Using `DataFrame`s - intermediate ones, processed ones, etc. By now you should know that the `DataFrame` is an essential tool!
 - Visualizations. We already know we can visualize directly from `DataFrame`s but it would also be great to see if you could utilize `bokeh` to create other charts.
 - Classification using `scikit-learn` or Natural Language Processing using `nltk`. After the Machine Learning lecture, if you want to try out some classification or NLP, that would be great to see.


Let's get started with loading some libraries

In [1]:
import pandas as pd
import numpy as np
from bokeh.io import show, output_notebook
from bokeh.plotting import figure

output_notebook()

## Loading the datasets

Here is an initial list of questions:
- main crop used between 1990 - 2005
    - fastest growing crops?
    - prediction for 2030, 2050? - interact?


- GHG emissions by 
    - area : which continent produce more?
	- global 
        - per use (crops & livestock, deforestation, degarded peatland, fire) : find net GHG emissions due to land use change and deforestation r
    
- other source of GHG? -->
- which livestock produce more GHG




In [2]:
crops = pd.read_csv('kaggle_global-food-agriculture-statistics/fao_data_crops_data.csv')
emissions = pd.read_csv('kaggle_global-food-agriculture-statistics/Environment_Emissions_intensities_E_All_Data.csv')

In [3]:
crops.head(10)

Unnamed: 0,country_or_area,element_code,element,year,unit,value,value_footnotes,category
0,Americas +,31,Area Harvested,2007.0,Ha,49404.0,A,agave_fibres_nes
1,Americas +,31,Area Harvested,2006.0,Ha,49404.0,A,agave_fibres_nes
2,Americas +,31,Area Harvested,2005.0,Ha,49404.0,A,agave_fibres_nes
3,Americas +,31,Area Harvested,2004.0,Ha,49113.0,A,agave_fibres_nes
4,Americas +,31,Area Harvested,2003.0,Ha,48559.0,A,agave_fibres_nes
5,Americas +,31,Area Harvested,2002.0,Ha,48506.0,A,agave_fibres_nes
6,Americas +,31,Area Harvested,2001.0,Ha,47767.0,A,agave_fibres_nes
7,Americas +,31,Area Harvested,2000.0,Ha,48747.0,A,agave_fibres_nes
8,Americas +,31,Area Harvested,1999.0,Ha,46978.0,A,agave_fibres_nes
9,Americas +,31,Area Harvested,1998.0,Ha,48571.0,A,agave_fibres_nes


In [4]:
crops.dtypes

country_or_area     object
element_code        object
element             object
year               float64
unit                object
value              float64
value_footnotes     object
category            object
dtype: object

In [5]:
col_list = crops.country_or_area.unique()
col_list
areas = ['Americas +', 
         'Asia +', 
         'Africa +',
         'Caribbean +', 
         'Central America +',
         'Low Income Food Deficit Countries +',
         'Net Food Importing Developing Countries +',
         'Small Island Developing States +',
         'South America +',
         'South-Eastern Asia +',
         'World +',
         'Australia and New Zealand +',
         'Oceania +',
         'Central Asia +',
         'Eastern Asia +',
         'Eastern Europe +',
         'Europe +',
         'European Union +',
         'Least Developed Countries +',
         'LandLocked developing countries +',
         'Least Developed Countries +',
         'Northern Africa +', 
         'Northern America +',
         'Southern Africa +',
         'Southern Asia +',
         'Southern Europe +',
         'Western Africa +',
         'Western Asia +',
         'Western Europe +',
         'Eastern Africa +',
         'Northern Europe +',
         'Middle Africa +',
         'Micronesia +',
         'Polynesia +', 
         'Melanesia +'
        ]

In [6]:
crops_regions = crops[crops.country_or_area.isin(areas)]
crops_regions = crops_regions.rename(columns={'country_or_area' : 'area'})

In [7]:
def fix_country_name(country):
    """Will remove the '+' from each region label"""
    return (country.strip(' +'))
crops_regions.area = crops_regions.area.apply(fix_country_name)
crops_regions.year = crops_regions.year.astype(int)
crops_regions.area = crops_regions.area.astype(str)

In [8]:
crops_world = crops_regions[(crops_regions['area'] == 'World') & (crops_regions['element'] == 'Production Quantity')]
crops_world_y = crops_world.groupby(['year']).sum()
crops_world_y.value = crops_world_y.value / 1e6

In [9]:
p = figure(plot_height=400, title="World crop production (Mtonnes) since 1960", tools = 'hover')
p.line(crops_world_y.index, crops_world_y.value, line_width=0.9)
p.xaxis.axis_label = 'Year'
p.yaxis.axis_label = 'Total production'
show(p)

In [10]:
#a = d_regions[(d_regions['element'] == 'Production Quantity') & (d_regions.year == '2010')].groupby('area').sum()
crops_area2005 = crops_regions[(crops_regions['year'] == 2005) & (crops_regions['element'] == 'Production Quantity')]
crops_harvest_2005 = crops_area2005.groupby('area')['value'].sum()

In [11]:
c_list = crops_harvest_2005.index.tolist()
p = figure(x_range=c_list, plot_height=500, title="Production in 2005 per region of the World")
p.vbar(x=c_list, top=crops_harvest_2005.values, width=0.9,)

# Set some properties to make the plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1

show(p)

In [12]:
### LE's work on emssions and create a table with the world emissions

emissions_regions = emissions[emissions.Area.isin(crops_regions['area'])]
emissions_int_world = emissions_regions[(emissions_regions.Element == 'Emissions intensity') & (emissions_regions.Area == 'World')]
#emissions_int_world.info() # Find if any empyt cells are found in the subdataset
to_drop = []
for kk in emissions_int_world.keys():
    if kk[-1]=='F':
        to_drop.append(kk)
    
emissions_int_world = emissions_int_world.drop(to_drop,axis=1) # Remove the Unit column
emissions_int_world.columns = emissions_int_world.columns.str.replace('Y','') # Remove the Y

emissions_int_world


Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,1961,1962,1963,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
6138,5000,World,1718,Cereals excluding rice,71761,Emissions intensity,kg CO2eq/kg product,0.1618,0.163,0.1719,...,0.2285,0.2135,0.2168,0.226,0.225,0.2306,0.2161,0.2141,0.2145,0.2106
6141,5000,World,27,"Rice, paddy",71761,Emissions intensity,kg CO2eq/kg product,1.8606,1.8387,1.6947,...,0.9818,0.9638,0.9697,0.9664,0.9393,0.9313,0.9364,0.9308,0.9235,0.9195
6144,5000,World,867,"Meat, cattle",71761,Emissions intensity,kg CO2eq/kg product,37.588,36.5186,35.1963,...,25.628,25.7614,25.5895,25.4802,25.5779,25.4334,25.0352,24.9822,25.3661,25.3965
6147,5000,World,882,"Milk, whole fresh cow",71761,Emissions intensity,kg CO2eq/kg product,1.6273,1.6105,1.6272,...,0.9669,0.9546,0.9676,0.9551,0.9412,0.9304,0.926,0.9052,0.8873,0.9002
6150,5000,World,1017,"Meat, goat",71761,Emissions intensity,kg CO2eq/kg product,53.4359,55.2056,56.6251,...,31.3431,31.4766,31.1532,30.2749,30.3772,30.0247,30.4036,30.2669,30.0147,30.5199
6153,5000,World,1020,"Milk, whole fresh goat",71761,Emissions intensity,kg CO2eq/kg product,2.2959,2.3469,2.3051,...,2.4762,2.4728,2.5167,2.5265,2.5499,2.6059,2.5529,2.5005,2.5982,2.8596
6156,5000,World,947,"Meat, buffalo",71761,Emissions intensity,kg CO2eq/kg product,103.2366,101.1451,100.0774,...,61.359,61.1803,60.232,57.5012,57.1905,56.8239,56.0871,56.0798,55.3591,55.4973
6159,5000,World,951,"Milk, whole fresh buffalo",71761,Emissions intensity,kg CO2eq/kg product,1.6897,1.6719,1.6584,...,1.0147,1.0039,0.9948,0.9791,0.9653,0.9516,0.9324,0.8986,0.9089,0.9067
6162,5000,World,977,"Meat, sheep",71761,Emissions intensity,kg CO2eq/kg product,43.8651,43.252,43.3207,...,22.6492,22.6997,21.9098,21.9202,22.1005,22.2855,22.2221,21.5015,21.4014,21.4285
6165,5000,World,982,"Milk, whole fresh sheep",71761,Emissions intensity,kg CO2eq/kg product,5.1939,5.2,5.0408,...,4.7105,4.7637,4.7424,4.834,5.0079,4.8842,4.7252,4.7592,4.7583,4.964


In [13]:
# Pivot the table to plot it

d_tot_emission = emissions_int_world.drop(['Area', 'Area Code','Element Code','Item Code'],axis=1)
d_tot_emission = d_tot_emission.pivot_table(index=None, columns='Item')

In [14]:
p = figure(x_range=d_tot_emission.columns.tolist(), plot_height=500, title="Total of the GHG emissions (kg CO2eq/kg product) for 1961 - 2016")
p.vbar(x=d_tot_emission.columns.tolist(), top=d_tot_emission.sum(), width=0.9)

# Set some properties to make the plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1

show(p)

In [15]:
# To pimp the previous plot : color by decades.
# I found it easier to keep the data for every 10 y rather than sum them

d_tot_emission.reset_index(inplace=True) # turn the year index into columns
d_tot_emission = d_tot_emission.rename(columns={'index' : 'years'})# rename the column
d_tot_emission.years = pd.to_numeric(d_tot_emission.years, downcast='integer') # turn years into int so we can filter them
d_tot_emission


Item,years,Cereals excluding rice,"Eggs, hen, in shell","Meat, buffalo","Meat, cattle","Meat, chicken","Meat, goat","Meat, pig","Meat, sheep","Milk, whole fresh buffalo","Milk, whole fresh camel","Milk, whole fresh cow","Milk, whole fresh goat","Milk, whole fresh sheep","Rice, paddy"
0,1961,0.1618,1.1505,103.2366,37.588,1.0073,53.4359,3.4908,43.8651,1.6897,3.8394,1.6273,2.2959,5.1939,1.8606
1,1962,0.163,1.1476,101.1451,36.5186,1.0115,55.2056,3.4938,43.252,1.6719,3.8313,1.6105,2.3469,5.2,1.8387
2,1963,0.1719,1.1287,100.0774,35.1963,1.009,56.6251,3.3577,43.3207,1.6584,3.7916,1.6272,2.3051,5.0408,1.6947
3,1964,0.1759,1.1012,101.5181,35.4845,0.9839,55.3735,3.2808,44.027,1.6496,3.7753,1.5882,2.3267,5.0146,1.6617
4,1965,0.1871,1.1035,100.6839,35.8378,0.929,53.5321,3.1694,44.4964,1.6492,3.759,1.5194,2.381,4.8876,1.7231
5,1966,0.188,1.1077,105.4525,34.7524,0.8909,52.3708,3.0807,44.0231,1.6468,3.7764,1.5143,2.3609,4.844,1.7108
6,1967,0.1926,1.1032,106.5046,34.0016,0.8923,51.9045,3.1317,43.5618,1.6895,3.724,1.474,2.4182,5.0444,1.6304
7,1968,0.1971,1.0998,106.8264,32.978,0.8912,51.8701,3.1049,42.8697,1.6632,3.7097,1.4525,2.4835,5.2116,1.5991
8,1969,0.2045,1.0846,107.1353,32.2959,0.8541,50.696,3.1155,43.7275,1.6549,3.7154,1.4494,2.5134,5.1329,1.5837
9,1970,0.2162,1.0904,105.4367,32.4799,0.8133,49.895,3.0368,41.732,1.7136,3.4356,1.4237,2.5372,5.1478,1.5094


In [16]:
decades = np.arange(1961, 2017, 10).tolist()
decades

tot_dec_emission = d_tot_emission[d_tot_emission['years'].isin(decades)]
tot_dec_emission

Item,years,Cereals excluding rice,"Eggs, hen, in shell","Meat, buffalo","Meat, cattle","Meat, chicken","Meat, goat","Meat, pig","Meat, sheep","Milk, whole fresh buffalo","Milk, whole fresh camel","Milk, whole fresh cow","Milk, whole fresh goat","Milk, whole fresh sheep","Rice, paddy"
0,1961,0.1618,1.1505,103.2366,37.588,1.0073,53.4359,3.4908,43.8651,1.6897,3.8394,1.6273,2.2959,5.1939,1.8606
10,1971,0.2053,1.0897,103.2246,33.3138,0.8079,49.2669,3.0872,41.2118,1.7338,3.425,1.4212,2.5689,5.1157,1.5302
20,1981,0.2414,0.8777,90.8772,31.2939,0.7553,46.3433,2.7672,39.5268,1.5547,3.2587,1.3231,2.5674,4.6501,1.3407
30,1991,0.2413,0.7639,74.2182,28.437,0.7722,36.784,2.2297,34.3866,1.4486,3.3682,1.1854,2.7528,4.6789,1.1232
40,2001,0.2277,0.6644,65.4022,27.6655,0.6543,34.3226,1.7479,24.1582,1.0854,3.0666,1.042,2.5827,4.8814,1.025
50,2011,0.225,0.6773,57.1905,25.5779,0.5983,30.3772,1.5846,22.1005,0.9653,2.6224,0.9412,2.5499,5.0079,0.9393


In [18]:
from bokeh.palettes import Spectral
from bokeh.models import ColumnDataSource

years = tot_dec_emission['years'].tolist()
product = tot_dec_emission.columns.tolist()[:-1] #-1 to remove the last column (year)

source = ColumnDataSource(data=tot_dec_emission)


p = figure(x_range=product, # food categories for x range
           plot_height=500,
           title="Total of the GHG emissions (kg CO2eq/kg product) for 1961 - 2016")

p.vbar_stack(years, 
             x='Item',
             width=0.9, 
             color=Spectral, 
             source=source)
             
# Set some properties to make the plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1

show(p)

ValueError: expected an element of Seq(String), got seq with invalid items [1961]

In [30]:
from bokeh.palettes import brewer
from bokeh.models import ColumnDataSource

TOOLS = "save,pan,box_zoom,reset,wheel_zoom,tap"
#years = tot_dec_emission['years'].tolist()
years = ['1961', '1971', '1981', '1991', '2001', '2011']
product = tot_dec_emission.columns.tolist()[:-1] #-1 to remove the last column (year)
source = ColumnDataSource(data=tot_dec_emission)

p = figure(plot_width=800, title="",toolbar_location='above', tools=TOOLS)
colors = brewer['RdYlBu'][6]

p.vbar_stack(years, x='Item', width=0.9, color=colors, source=source)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None

show(p)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: Item [renderer: GlyphRenderer(id='1692', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: Item [renderer: GlyphRenderer(id='1697', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: Item [renderer: GlyphRenderer(id='1702', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: Item [renderer: GlyphRenderer(id='1707', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: Item [renderer: GlyphRenderer(id='1712', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: Item [renderer: GlyphRenderer(id='1717', ...)]


In [28]:
years = tot_dec_emission['years'].tolist()
years

[1961, 1971, 1981, 1991, 2001, 2011]