# Capstone Project Part 2: Open Challenge

For part 2 of your Capstone Project assignment, I want you to submit your own Jupyter Notebook written from scratch. I also want you to select your own data source **and** *your own questions* to ask about the data you have selected.

This sounds difficult -- and it is. But the point here is to give you the experience in exploring data yourself and understanding that a big part of data science is in asking questions and exploring on your own. Who knows, you might find something interesting and valuable enough that this time next year you could be CEO of your own multimillion pound start-up!

Think back to exercise 8 (London 2012 Olympics data) and the kinds of questions I set for you in that challenge. This time however, I want you to demonstrate as much of what you have learned in this course as possible. In particular, I want you to create a Jupyter Notebook that demonstrates the following:
 - Gathering data from a data source. You could do this programmatically (e.g. with a Python library querying an API such as `tweepy`), or just downloaded from somewhere. If the latter, please add some text describing where you got the data from and why you thought it might be interesting.
 - Data formatting and cleaning. If your data is semi-structured and not already in a CSV, it would be great to see how you mapped it across using some string formatting. Also examples of data cleaning -- removing spurious values or dealing with missing values.
 - Using `DataFrame`s - intermediate ones, processed ones, etc. By now you should know that the `DataFrame` is an essential tool!
 - Visualizations. We already know we can visualize directly from `DataFrame`s but it would also be great to see if you could utilize `bokeh` to create other charts.
 - Classification using `scikit-learn` or Natural Language Processing using `nltk`. After the Machine Learning lecture, if you want to try out some classification or NLP, that would be great to see.


Let's get started with loading some libraries

In [None]:
import pandas as pd
import numpy as np
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
from bokeh.palettes import brewer
from bokeh.models import ColumnDataSource, value

output_notebook()

## Loading the datasets

Here is an initial list of questions:
- main crop used between 1990 - 2005
    - fastest growing crops?
    - prediction for 2030, 2050? - interact?


- GHG emissions by 
    - area : which continent produce more?
	- global 
        - per use (crops & livestock, deforestation, degarded peatland, fire) : find net GHG emissions due to land use change and deforestation r
    
- other source of GHG? -->
- which livestock produce more GHG




In [None]:
crops = pd.read_csv('kaggle_global-food-agriculture-statistics/fao_data_crops_data.csv')
emissions = pd.read_csv('kaggle_global-food-agriculture-statistics/Environment_Emissions_intensities_E_All_Data.csv')

In [None]:
crops.head(10)

In [None]:
crops.dtypes

In [None]:
col_list = crops.country_or_area.unique()
col_list
areas = ['Americas +', 
         'Asia +', 
         'Africa +',
         'Caribbean +', 
         'Central America +',
         'Low Income Food Deficit Countries +',
         'Net Food Importing Developing Countries +',
         'Small Island Developing States +',
         'South America +',
         'South-Eastern Asia +',
         'World +',
         'Australia and New Zealand +',
         'Oceania +',
         'Central Asia +',
         'Eastern Asia +',
         'Eastern Europe +',
         'Europe +',
         'European Union +',
         'Least Developed Countries +',
         'LandLocked developing countries +',
         'Least Developed Countries +',
         'Northern Africa +', 
         'Northern America +',
         'Southern Africa +',
         'Southern Asia +',
         'Southern Europe +',
         'Western Africa +',
         'Western Asia +',
         'Western Europe +',
         'Eastern Africa +',
         'Northern Europe +',
         'Middle Africa +',
         'Micronesia +',
         'Polynesia +', 
         'Melanesia +'
        ]

In [None]:
crops_regions = crops[crops.country_or_area.isin(areas)]
crops_regions = crops_regions.rename(columns={'country_or_area' : 'area'})

In [None]:
def fix_country_name(country):
    """Will remove the '+' from each region label"""
    return (country.strip(' +'))
crops_regions.area = crops_regions.area.apply(fix_country_name)
crops_regions.year = crops_regions.year.astype(int)
crops_regions.area = crops_regions.area.astype(str)

In [None]:
crops_world = crops_regions[(crops_regions['area'] == 'World') & (crops_regions['element'] == 'Production Quantity')]
crops_world_y = crops_world.groupby(['year']).sum()
crops_world_y.value = crops_world_y.value / 1e6

In [None]:
p = figure(plot_height=400, title="World crop production (Mtonnes) since 1960", tools = 'hover')
p.line(crops_world_y.index, crops_world_y.value, line_width=0.9)
p.xaxis.axis_label = 'Year'
p.yaxis.axis_label = 'Total production'
show(p)

In [None]:
#a = d_regions[(d_regions['element'] == 'Production Quantity') & (d_regions.year == '2010')].groupby('area').sum()
crops_area2005 = crops_regions[(crops_regions['year'] == 2005) & (crops_regions['element'] == 'Production Quantity')]
crops_harvest_2005 = crops_area2005.groupby('area')['value'].sum()

In [None]:
c_list = crops_harvest_2005.index.tolist()
p = figure(x_range=c_list, plot_height=500, title="Production in 2005 per region of the World")
p.vbar(x=c_list, top=crops_harvest_2005.values, width=0.9,)

# Set some properties to make the plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1

show(p)

In [None]:
### LE's work on emssions and create a table with the world emissions
emissions[(emissions.Item == 'Meat, cattle' )& (emissions.Area == 'World')]

In [None]:
emissions_regions = emissions[emissions.Area.isin(crops_regions['area'])]
emissions_int_world = emissions_regions[(emissions_regions.Element == 'Emissions intensity') & (emissions_regions.Area == 'World')]
#emissions_int_world.info() # Find if any empyt cells are found in the subdataset
to_drop = []
for kk in emissions_int_world.keys():
    if kk[-1]=='F':
        to_drop.append(kk)
    
emissions_int_world = emissions_int_world.drop(to_drop,axis=1) # Remove the Unit column
emissions_int_world.columns = emissions_int_world.columns.str.replace('Y','') # Remove the Y

emissions_int_world


In [173]:
# Pivot the table to plot it to have rows of years

d_tot_emission = emissions_int_world.drop(['Area', 'Area Code','Element Code','Item Code'],axis=1)
d_tot_emission = d_tot_emission.pivot_table(index=None, columns='Item')
d_tot_emission

Item,Cereals excluding rice,"Eggs, hen, in shell","Meat, buffalo","Meat, cattle","Meat, chicken","Meat, goat","Meat, pig","Meat, sheep","Milk, whole fresh buffalo","Milk, whole fresh camel","Milk, whole fresh cow","Milk, whole fresh goat","Milk, whole fresh sheep","Rice, paddy"
1961,0.1618,1.1505,103.2366,37.588,1.0073,53.4359,3.4908,43.8651,1.6897,3.8394,1.6273,2.2959,5.1939,1.8606
1962,0.163,1.1476,101.1451,36.5186,1.0115,55.2056,3.4938,43.252,1.6719,3.8313,1.6105,2.3469,5.2,1.8387
1963,0.1719,1.1287,100.0774,35.1963,1.009,56.6251,3.3577,43.3207,1.6584,3.7916,1.6272,2.3051,5.0408,1.6947
1964,0.1759,1.1012,101.5181,35.4845,0.9839,55.3735,3.2808,44.027,1.6496,3.7753,1.5882,2.3267,5.0146,1.6617
1965,0.1871,1.1035,100.6839,35.8378,0.929,53.5321,3.1694,44.4964,1.6492,3.759,1.5194,2.381,4.8876,1.7231
1966,0.188,1.1077,105.4525,34.7524,0.8909,52.3708,3.0807,44.0231,1.6468,3.7764,1.5143,2.3609,4.844,1.7108
1967,0.1926,1.1032,106.5046,34.0016,0.8923,51.9045,3.1317,43.5618,1.6895,3.724,1.474,2.4182,5.0444,1.6304
1968,0.1971,1.0998,106.8264,32.978,0.8912,51.8701,3.1049,42.8697,1.6632,3.7097,1.4525,2.4835,5.2116,1.5991
1969,0.2045,1.0846,107.1353,32.2959,0.8541,50.696,3.1155,43.7275,1.6549,3.7154,1.4494,2.5134,5.1329,1.5837
1970,0.2162,1.0904,105.4367,32.4799,0.8133,49.895,3.0368,41.732,1.7136,3.4356,1.4237,2.5372,5.1478,1.5094


In [174]:
p = figure(x_range=d_tot_emission.columns.tolist(), plot_height=500, title="Total of the GHG emissions intensity for agriculture for 1961 - 2016")
p.vbar(x=d_tot_emission.columns.tolist(), top=d_tot_emission.sum(), width=0.9)

# Set some properties to make the plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1
p.yaxis.axis_label = 'GHG emissions (kg CO2eq/kg product)'
show(p)
##### should try to sort the values for eaiser conclusions

In [179]:
#### from here i decided to switch the plot axis (eaiser?) stack by products instead of year
# categories will be x axis
# years will be stackers
years = ['1961', '1971', '1981', '1991', '2001', '2011']
products = sorted(emissions_int_world['Item'].tolist())

temp_d = emissions_int_world[years].join(emissions_int_world['Item'])
temp_d = temp_d.sort_values(by=['Item']).reset_index(drop=True)
temp_d

Unnamed: 0,1961,1971,1981,1991,2001,2011,Item
0,0.1618,0.2053,0.2414,0.2413,0.2277,0.225,Cereals excluding rice
1,1.1505,1.0897,0.8777,0.7639,0.6644,0.6773,"Eggs, hen, in shell"
2,103.2366,103.2246,90.8772,74.2182,65.4022,57.1905,"Meat, buffalo"
3,37.588,33.3138,31.2939,28.437,27.6655,25.5779,"Meat, cattle"
4,1.0073,0.8079,0.7553,0.7722,0.6543,0.5983,"Meat, chicken"
5,53.4359,49.2669,46.3433,36.784,34.3226,30.3772,"Meat, goat"
6,3.4908,3.0872,2.7672,2.2297,1.7479,1.5846,"Meat, pig"
7,43.8651,41.2118,39.5268,34.3866,24.1582,22.1005,"Meat, sheep"
8,1.6897,1.7338,1.5547,1.4486,1.0854,0.9653,"Milk, whole fresh buffalo"
9,3.8394,3.425,3.2587,3.3682,3.0666,2.6224,"Milk, whole fresh camel"


In [180]:
from bokeh.palettes import brewer
from bokeh.models import ColumnDataSource, value

source = ColumnDataSource(data=temp_d)

p = figure(x_range = products, plot_width=800, title="Total of the GHG emissions intensity for agriculture")#, toolbar_location='above', tools=TOOLS)
colors = brewer['Spectral'][6]

p.vbar_stack(years, x='Item', width=0.9, source=source, color=colors,
                legend=[value(x) for x in years])

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1
p.xaxis.axis_label = 'Year'
p.yaxis.axis_label = 'GHG emssions (kg CO2eq/kg product)'
show(p)

In [181]:
# No we focus on meat to display stack emissions per meat
d_tot_emission.reset_index(inplace=True) # turn the year index into columns
d_tot_emission = d_tot_emission.rename(columns={'index' : 'years'})# rename the column
#d_tot_emission.years = pd.to_numeric(d_tot_emission.years, downcast='integer') # turn years into int for filtering

In [187]:
allyears = d_tot_emission.years.tolist()
meat_products = ['Meat, buffalo',
                 'Meat, cattle',
                 'Meat, chicken',
                 'Meat, goat',
                 'Meat, pig',
                 'Meat, sheep']
meat_emissions = d_tot_emission[meat_products].join(d_tot_emission['years'])
meat_emissions
meat_emissions.columns = ['Buffalo', 'Cattle', 'Chicken', 'Goat', 'Pig', 'Sheep', 'years']

In [191]:
source1 = ColumnDataSource(data=meat_emissions)

p1 = figure(x_range = allyears, plot_width=800, title="GHG emission intensity per meat since 1961")#, toolbar_location='above', tools=TOOLS)
colors = brewer['Paired'][6]

p1.line(meat_emissions['years'], meat_emissions['Buffalo'], legend="Buffalo meat", line_color = colors[0], line_width = 3)
p1.line(meat_emissions['years'], meat_emissions['Cattle'], legend="Cattle meat", line_color = colors[1], line_width = 3)
p1.line(meat_emissions['years'], meat_emissions['Chicken'], legend="Chicken meat", line_color = colors[2], line_width = 3)
p1.line(meat_emissions['years'], meat_emissions['Goat'], legend="Goat meat", line_color = colors[3], line_width = 3)
p1.line(meat_emissions['years'], meat_emissions['Pig'], legend="Pig meat", line_color = colors[4], line_width = 3)
p1.line(meat_emissions['years'], meat_emissions['Sheep'], legend="Sheep meat", line_color = colors[5], line_width = 3)


#p1.xgrid.grid_line_color = None
#p1.y_range.start = 0
p1.xaxis.major_label_orientation = 1
p1.xaxis.axis_label = 'Year'
p1.yaxis.axis_label = 'GHG emssions (kg CO2eq/kg product)'
#p1.legend.orientation = "horizontal"
show(p1)

In [186]:
source1 = ColumnDataSource(data=meat_emissions)

p1 = figure(x_range = allyears, plot_width=800, title="Finally a plot")#, toolbar_location='above', tools=TOOLS)
colors = brewer['Paired'][6]

p1.vbar_stack(meat_emissions.columns.tolist()[:-1], x='years', width=0.9, source=source1, color=colors,
             legend=[value(x) for x in meat_emissions.columns.tolist()[:-1]])

#p1.xgrid.grid_line_color = None
#p1.y_range.start = 0
p1.xaxis.major_label_orientation = 1
p1.xaxis.axis_label = 'Year'
p1.yaxis.axis_label = 'GHG emssions (kg CO2eq/kg product)'
#p1.legend.orientation = "horizontal"
show(p1)