# Capstone Project Part 2: Open Challenge

For part 2 of your Capstone Project assignment, I want you to submit your own Jupyter Notebook written from scratch. I also want you to select your own data source **and** *your own questions* to ask about the data you have selected.

This sounds difficult -- and it is. But the point here is to give you the experience in exploring data yourself and understanding that a big part of data science is in asking questions and exploring on your own. Who knows, you might find something interesting and valuable enough that this time next year you could be CEO of your own multimillion pound start-up!

Think back to exercise 8 (London 2012 Olympics data) and the kinds of questions I set for you in that challenge. This time however, I want you to demonstrate as much of what you have learned in this course as possible. In particular, I want you to create a Jupyter Notebook that demonstrates the following:
 - Gathering data from a data source. You could do this programmatically (e.g. with a Python library querying an API such as `tweepy`), or just downloaded from somewhere. If the latter, please add some text describing where you got the data from and why you thought it might be interesting.
 - Data formatting and cleaning. If your data is semi-structured and not already in a CSV, it would be great to see how you mapped it across using some string formatting. Also examples of data cleaning -- removing spurious values or dealing with missing values.
 - Using `DataFrame`s - intermediate ones, processed ones, etc. By now you should know that the `DataFrame` is an essential tool!
 - Visualizations. We already know we can visualize directly from `DataFrame`s but it would also be great to see if you could utilize `bokeh` to create other charts.
 - Classification using `scikit-learn` or Natural Language Processing using `nltk`. After the Machine Learning lecture, if you want to try out some classification or NLP, that would be great to see.


Let's get started with loading some libraries

In [96]:
import pandas as pd
import numpy as np
from bokeh.io import show, output_notebook
from bokeh.plotting import figure

output_notebook()

## Loading the datasets

Here is an initial list of questions:
- main crop used between 1990 - 2005
    - fastest growing crops?
    - prediction for 2030, 2050? - interact?


- GHG emissions by 
    - area : which continent produce more?
	- global 
        - per use (crops & livestock, deforestation, degarded peatland, fire) : find net GHG emissions due to land use change and deforestation r
    
- other source of GHG? -->
- which livestock produce more GHG




In [152]:
crops = pd.read_csv('kaggle_global-food-agriculture-statistics/fao_data_crops_data.csv')
emissions = pd.read_csv('kaggle_global-food-agriculture-statistics/Environment_Emissions_intensities_E_All_Data.csv')

In [98]:
crops.head(10)

Unnamed: 0,country_or_area,element_code,element,year,unit,value,value_footnotes,category
0,Americas +,31,Area Harvested,2007.0,Ha,49404.0,A,agave_fibres_nes
1,Americas +,31,Area Harvested,2006.0,Ha,49404.0,A,agave_fibres_nes
2,Americas +,31,Area Harvested,2005.0,Ha,49404.0,A,agave_fibres_nes
3,Americas +,31,Area Harvested,2004.0,Ha,49113.0,A,agave_fibres_nes
4,Americas +,31,Area Harvested,2003.0,Ha,48559.0,A,agave_fibres_nes
5,Americas +,31,Area Harvested,2002.0,Ha,48506.0,A,agave_fibres_nes
6,Americas +,31,Area Harvested,2001.0,Ha,47767.0,A,agave_fibres_nes
7,Americas +,31,Area Harvested,2000.0,Ha,48747.0,A,agave_fibres_nes
8,Americas +,31,Area Harvested,1999.0,Ha,46978.0,A,agave_fibres_nes
9,Americas +,31,Area Harvested,1998.0,Ha,48571.0,A,agave_fibres_nes


In [99]:
crops.dtypes

country_or_area     object
element_code        object
element             object
year               float64
unit                object
value              float64
value_footnotes     object
category            object
dtype: object

In [100]:
col_list = crops.country_or_area.unique()
col_list
areas = ['Americas +', 
         'Asia +', 
         'Africa +',
         'Caribbean +', 
         'Central America +',
         'Low Income Food Deficit Countries +',
         'Net Food Importing Developing Countries +',
         'Small Island Developing States +',
         'South America +',
         'South-Eastern Asia +',
         'World +',
         'Australia and New Zealand +',
         'Oceania +',
         'Central Asia +',
         'Eastern Asia +',
         'Eastern Europe +',
         'Europe +',
         'European Union +',
         'Least Developed Countries +',
         'LandLocked developing countries +',
         'Least Developed Countries +',
         'Northern Africa +', 
         'Northern America +',
         'Southern Africa +',
         'Southern Asia +',
         'Southern Europe +',
         'Western Africa +',
         'Western Asia +',
         'Western Europe +',
         'Eastern Africa +',
         'Northern Europe +',
         'Middle Africa +',
         'Micronesia +',
         'Polynesia +', 
         'Melanesia +'
        ]

In [101]:
crops_regions = crops[crops.country_or_area.isin(areas)]
crops_regions = crops_regions.rename(columns={'country_or_area' : 'area'})

In [103]:
def fix_country_name(country):
    """Will remove the '+' from each region label"""
    return (country.strip(' +'))
crops_regions.area = crops_regions.area.apply(fix_country_name)
crops_regions.year = crops_regions.year.astype(int)
crops_regions.area = crops_regions.area.astype(str)

In [104]:
crops_world = crops_regions[(crops_regions['area'] == 'World') & (crops_regions['element'] == 'Production Quantity')]
crops_world_y = crops_world.groupby(['year']).sum()
crops_world_y.value = crops_world_y.value / 1e6

In [105]:
p = figure(plot_height=400, title="World crop production (Mtonnes) since 1960", tools = 'hover')
p.line(crops_world_y.index, crops_world_y.value, line_width=0.9)
p.xaxis.axis_label = 'Year'
p.yaxis.axis_label = 'Total production'
show(p)

In [197]:
emissions_regions = emissions[emissions.Area.isin(crops_regions['area'])]
emissions_int_world = emissions_regions[(emissions_regions.Element == 'Emissions intensity') & (emissions_regions.Area == 'World')]
#emissions_int_world.info() # Find if any empyt cells are found in the subdataset
to_drop = []
for kk in emissions_int_world.keys():
    if kk[-1]=='F':
        to_drop.append(kk)
    
emissions_int_world = emissions_int_world.drop(to_drop,axis=1) # Remove the Unit column
emissions_int_world.columns = emissions_int_world.columns.str.replace('Y','') # Remove the Y

emissions_int_world


Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,1961,1962,1963,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
6138,5000,World,1718,Cereals excluding rice,71761,Emissions intensity,kg CO2eq/kg product,0.1618,0.163,0.1719,...,0.2285,0.2135,0.2168,0.226,0.225,0.2306,0.2161,0.2141,0.2145,0.2106
6141,5000,World,27,"Rice, paddy",71761,Emissions intensity,kg CO2eq/kg product,1.8606,1.8387,1.6947,...,0.9818,0.9638,0.9697,0.9664,0.9393,0.9313,0.9364,0.9308,0.9235,0.9195
6144,5000,World,867,"Meat, cattle",71761,Emissions intensity,kg CO2eq/kg product,37.588,36.5186,35.1963,...,25.628,25.7614,25.5895,25.4802,25.5779,25.4334,25.0352,24.9822,25.3661,25.3965
6147,5000,World,882,"Milk, whole fresh cow",71761,Emissions intensity,kg CO2eq/kg product,1.6273,1.6105,1.6272,...,0.9669,0.9546,0.9676,0.9551,0.9412,0.9304,0.926,0.9052,0.8873,0.9002
6150,5000,World,1017,"Meat, goat",71761,Emissions intensity,kg CO2eq/kg product,53.4359,55.2056,56.6251,...,31.3431,31.4766,31.1532,30.2749,30.3772,30.0247,30.4036,30.2669,30.0147,30.5199
6153,5000,World,1020,"Milk, whole fresh goat",71761,Emissions intensity,kg CO2eq/kg product,2.2959,2.3469,2.3051,...,2.4762,2.4728,2.5167,2.5265,2.5499,2.6059,2.5529,2.5005,2.5982,2.8596
6156,5000,World,947,"Meat, buffalo",71761,Emissions intensity,kg CO2eq/kg product,103.2366,101.1451,100.0774,...,61.359,61.1803,60.232,57.5012,57.1905,56.8239,56.0871,56.0798,55.3591,55.4973
6159,5000,World,951,"Milk, whole fresh buffalo",71761,Emissions intensity,kg CO2eq/kg product,1.6897,1.6719,1.6584,...,1.0147,1.0039,0.9948,0.9791,0.9653,0.9516,0.9324,0.8986,0.9089,0.9067
6162,5000,World,977,"Meat, sheep",71761,Emissions intensity,kg CO2eq/kg product,43.8651,43.252,43.3207,...,22.6492,22.6997,21.9098,21.9202,22.1005,22.2855,22.2221,21.5015,21.4014,21.4285
6165,5000,World,982,"Milk, whole fresh sheep",71761,Emissions intensity,kg CO2eq/kg product,5.1939,5.2,5.0408,...,4.7105,4.7637,4.7424,4.834,5.0079,4.8842,4.7252,4.7592,4.7583,4.964


In [122]:
#a = d_regions[(d_regions['element'] == 'Production Quantity') & (d_regions.year == '2010')].groupby('area').sum()
crops_area2005 = crops_regions[(crops_regions['year'] == 2005) & (crops_regions['element'] == 'Production Quantity')]
crops_harvest_2005 = crops_area2005.groupby('area')['value'].sum()

array([1.38663434e+09, 3.94518078e+09, 6.85474058e+09, 2.05937419e+08,
       5.82045710e+07, 3.28877789e+08, 1.27815700e+08, 3.32803119e+08,
       3.18331486e+09, 1.08696834e+09, 2.40723481e+09, 1.62387964e+09,
       5.52988254e+08, 8.80808537e+08, 6.94455986e+09, 1.58568740e+07,
       4.04041000e+05, 1.20387440e+08, 1.52416858e+09, 2.88787832e+08,
       2.05563303e+09, 2.23995260e+08, 2.22991879e+08, 7.93540000e+05,
       8.92178680e+07, 1.50246538e+09, 1.23387008e+09, 1.08128675e+08,
       1.93678563e+09, 4.50048816e+08, 5.36527272e+08, 3.72954303e+08,
       6.46222388e+08, 1.48167824e+10])

In [232]:
c_list = crops_harvest_2005.index.tolist()
p = figure(x_range=c_list, plot_height=500, title="Production in 2005 per region of the World")
p.vbar(x=c_list, top=crops_harvest_2005.values, width=0.9,)

# Set some properties to make the plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1

show(p)

In [245]:
d_tot_emission = emissions_int_world.drop(['Area', 'Area Code','Element Code','Item Code'],axis=1)
d_tot_emission = d_tot_emission.pivot_table(index=None, columns='Item')



p = figure(x_range=d_tot_emission.columns.tolist(), plot_height=500, title="")
p.vbar(x=d_tot_emission.columns.tolist(), top=d_tot_emission.sum(), width=0.9)

# Set some properties to make the plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1

show(p)

In [243]:
d_tot_emission.sum()

Item
Cereals excluding rice         12.5286
Eggs, hen, in shell            47.3482
Meat, buffalo                4499.0889
Meat, cattle                 1670.0256
Meat, chicken                  41.9616
Meat, goat                   2269.0493
Meat, pig                     133.1405
Meat, sheep                  1880.5885
Milk, whole fresh buffalo      75.7228
Milk, whole fresh camel       183.6111
Milk, whole fresh cow          67.9288
Milk, whole fresh goat        143.6335
Milk, whole fresh sheep       271.0397
Rice, paddy                    69.9881
dtype: float64

In [None]:
p = figure(x_range=year_list, plot_height=250, title="Fruit Counts by Year",
           toolbar_location=None, tools="hover", tooltips="$name @fruits: @$name")

p.vbar_stack(years, x='year', width=0.9, color=colors, source=data,
             legend=[value(x) for x in years])

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"