# Final Project

### Are we producing enough food?
**Describe the background of the problem**  
Our world population is expected to grow from 7.3 billion today to 9.7 billion in the year 2050. Finding solutions for feeding the growing world population has become a hot topic for food and agriculture organizations, entrepreneurs and philanthropists. These solutions range from changing the way we grow our food to changing the way we eat. To make things harder, the world's climate is changing and it is both affecting and affected by the way we grow our food – agriculture. This dataset provides an insight on our worldwide food production - focusing on a comparison between food produced for human consumption and feed produced for animals. 

**Describe where to get the data**  
The Food and Agriculture Organization of the United Nations provides free access to food and agriculture data for over 245 countries and territories, from the year 1961 to the most recent update (depends on the dataset). One dataset from the FAO's database is the Food Balance Sheets. It presents a comprehensive picture of the pattern of a country's food supply during a specified reference period, the last time an update was loaded to the FAO database was in 2013. The food balance sheet shows for each food item the sources of supply and its utilization. This chunk of the dataset is focused on two utilizations of each food item available:

* Food - refers to the total amount of the food item available as human food during the reference period.
* Feed - refers to the quantity of the food item available for feeding to the livestock and poultry during the reference period.

The dataset is gathered from: https://www.kaggle.com/dorbicycle/world-foodfeed-production

**Frame the Machine Learning problem: What are input features? What is the model expected to learn? Is it supervised learning / 
unsupervised learning? Is it a classification / regression problem?**  

*Input features:*   
* Area
* Item (food/feed)
* Element 
* Unit
* Y1961
.
.
. Y2013  

The idea is, if we continue with the same patterns of food and feed production, will there be enough food to feed the increasing population by 2050? It is supervised learning, regression problem.

**Describe briefly the research plan: what models to use? How to measure the performance?**  
I will use logistic regression to predict future food/feed production and measure the performance by comparing what is deemed the minimum required about of food to feed a given population to the result of the model with the given population.

### Biggest producers of FOOD in the world:
China, India, USA, and Brazil are the leading producers of food in the world since 1961. 

(graphs are based off this database before editing and from https://www.kaggle.com/farazrahman/are-we-producing-enough-food-simple-time-series/notebook)
![title](Data/biggest_producers_of_food.png)

### Biggest producers of FEED in the world:
China, USA, and Brazil are the leading producers of feed in the world since 1961. 

(graphs are based off this database before editing and from https://www.kaggle.com/farazrahman/are-we-producing-enough-food-simple-time-series/notebook)
![title](Data/biggest_producers_of_feed.png)

### Top 10 food items:
These food items have consistently been the most produced world-wide, however, there was an increase in fruits and vegetables and decrease in starchy items. Milk, cereals, and vegetables remain the top three food items produced.

(graphs are based off this database before editing and from https://www.kaggle.com/farazrahman/are-we-producing-enough-food-simple-time-series/notebook)
![title](Data/top_10_food_items.png)

### Top 10 feed items:
Cereals, maize, and starchy roots are the top three feed items produced.

(graphs are based off this database before editing and from https://www.kaggle.com/farazrahman/are-we-producing-enough-food-simple-time-series/notebook)
![title](Data/top_10_feed_items.png)

### Food versus feed production:
There is a steady increase in the production of food, clearly accomodating for the increase in population over the years. However, the production of feed increases at a much slower rate. There is a noticeable gap between the production of food versus the production of feed.

![title](Data/food_vs_feed_production.png)

In [98]:
# import csv as a dataframe

import os
datapath = os.getcwd()
datapath = os.path.join(datapath, '')
os.chdir(datapath)
os.getcwd()


import pandas as pd
# cleaned up database; missing values for the amount of feed and food produced during certain 
# years were replaced with the median values for the years that had data
fao = pd.read_csv(datapath + 'FAO.csv', encoding = "latin1") 
# added an empty column for the year we want to guess production
fao_test = pd.read_csv(datapath + 'FAO_test.csv', encoding = "latin1") 


# Change Element into 0 = Food (5142) and 1 = Feed (5521)
# change the unit to an integer
fao.replace(('Feed', 'Food'), (1, 0), inplace=True)
fao.replace(('1000 tonnes'), (1000), inplace=True)
fao_test.replace(('Feed', 'Food'), (1, 0), inplace=True)
fao_test.replace(('1000 tonnes'), (1000), inplace=True)

# cols = ['Y1961',
#        'Y1962', 'Y1963', 'Y1964', 'Y1965', 'Y1966', 'Y1967', 'Y1968', 'Y1969',
#        'Y1970', 'Y1971', 'Y1972', 'Y1973', 'Y1974', 'Y1975', 'Y1976', 'Y1977',
#        'Y1978', 'Y1979', 'Y1980', 'Y1981', 'Y1982', 'Y1983', 'Y1984', 'Y1985',
#        'Y1986', 'Y1987', 'Y1988', 'Y1989', 'Y1990', 'Y1991', 'Y1992', 'Y1993',
#        'Y1994', 'Y1995', 'Y1996', 'Y1997', 'Y1998', 'Y1999', 'Y2000', 'Y2001',
#        'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007', 'Y2008', 'Y2009',
#        'Y2010', 'Y2011', 'Y2012', 'Y2013']

# for col in cols:
#    fao[col] = fao[col].apply(lambda x: int(x) if x == x else 0)

fao = fao[['Area Abbreviation','Item Code','Element','Unit','Y1961',
       'Y1962', 'Y1963', 'Y1964', 'Y1965', 'Y1966', 'Y1967', 'Y1968', 'Y1969',
       'Y1970', 'Y1971', 'Y1972', 'Y1973', 'Y1974', 'Y1975', 'Y1976', 'Y1977',
       'Y1978', 'Y1979', 'Y1980', 'Y1981', 'Y1982', 'Y1983', 'Y1984', 'Y1985',
       'Y1986', 'Y1987', 'Y1988', 'Y1989', 'Y1990', 'Y1991', 'Y1992', 'Y1993',
       'Y1994', 'Y1995', 'Y1996', 'Y1997', 'Y1998', 'Y1999', 'Y2000', 'Y2001',
       'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007', 'Y2008', 'Y2009',
       'Y2010', 'Y2011', 'Y2012', 'Y2013']]

fao_test = fao_test[['Area Abbreviation','Item Code','Element','Unit','Y1961',
       'Y1962', 'Y1963', 'Y1964', 'Y1965', 'Y1966', 'Y1967', 'Y1968', 'Y1969',
       'Y1970', 'Y1971', 'Y1972', 'Y1973', 'Y1974', 'Y1975', 'Y1976', 'Y1977',
       'Y1978', 'Y1979', 'Y1980', 'Y1981', 'Y1982', 'Y1983', 'Y1984', 'Y1985',
       'Y1986', 'Y1987', 'Y1988', 'Y1989', 'Y1990', 'Y1991', 'Y1992', 'Y1993',
       'Y1994', 'Y1995', 'Y1996', 'Y1997', 'Y1998', 'Y1999', 'Y2000', 'Y2001',
       'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007', 'Y2008', 'Y2009',
       'Y2010', 'Y2011', 'Y2012', 'Y2013','Y2050']]

print(fao.head(20))
print(fao_test.head(20))

   Area Abbreviation  Item Code  Element  Unit  Y1961  Y1962  Y1963  Y1964  \
0                AFG       2511        0  1000   1928   1904   1666   1950   
1                AFG       2805        0  1000    183    183    182    220   
2                AFG       2513        1  1000     76     76     76     76   
3                AFG       2513        0  1000    237    237    237    238   
4                AFG       2514        1  1000    210    210    214    216   
5                AFG       2514        0  1000    403    403    410    415   
6                AFG       2517        0  1000     17     18     19     20   
7                AFG       2520        0  1000      0      0      0      0   
8                AFG       2531        0  1000    111     97    103    110   
9                AFG       2536        1  1000     45     45     45     45   
10               AFG       2537        1  1000      0      0      0      0   
11               AFG       2542        0  1000     45     41    

In [91]:
# Drop columns with strings
fao_noStrings = fao[['Item Code','Element','Unit','Y1961',
       'Y1962', 'Y1963', 'Y1964', 'Y1965', 'Y1966', 'Y1967', 'Y1968', 'Y1969',
       'Y1970', 'Y1971', 'Y1972', 'Y1973', 'Y1974', 'Y1975', 'Y1976', 'Y1977',
       'Y1978', 'Y1979', 'Y1980', 'Y1981', 'Y1982', 'Y1983', 'Y1984', 'Y1985',
       'Y1986', 'Y1987', 'Y1988', 'Y1989', 'Y1990', 'Y1991', 'Y1992', 'Y1993',
       'Y1994', 'Y1995', 'Y1996', 'Y1997', 'Y1998', 'Y1999', 'Y2000', 'Y2001',
       'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007', 'Y2008', 'Y2009',
       'Y2010', 'Y2011', 'Y2012', 'Y2013']]

fao_test_noStrings = fao_test[['Item Code','Element','Unit','Y1961',
       'Y1962', 'Y1963', 'Y1964', 'Y1965', 'Y1966', 'Y1967', 'Y1968', 'Y1969',
       'Y1970', 'Y1971', 'Y1972', 'Y1973', 'Y1974', 'Y1975', 'Y1976', 'Y1977',
       'Y1978', 'Y1979', 'Y1980', 'Y1981', 'Y1982', 'Y1983', 'Y1984', 'Y1985',
       'Y1986', 'Y1987', 'Y1988', 'Y1989', 'Y1990', 'Y1991', 'Y1992', 'Y1993',
       'Y1994', 'Y1995', 'Y1996', 'Y1997', 'Y1998', 'Y1999', 'Y2000', 'Y2001',
       'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007', 'Y2008', 'Y2009',
       'Y2010', 'Y2011', 'Y2012', 'Y2013']]

# feature scaling

from sklearn.preprocessing import StandardScaler
sclr = StandardScaler()

fao_train = sclr.fit_transform(fao_noStrings)
print(fao_train)
fao_train = pd.DataFrame(fao_train, columns=fao_noStrings.columns)
print(fao_train.shape)


fao_test = sclr.fit_transform(fao_test_noStrings)
print(fao_test)
fao_test = pd.DataFrame(fao_test, columns=fao_test_noStrings.columns)
print(fao_test.shape)

[[-1.22985571 -0.47465426  0.         ...,  0.70850251  0.7026396
   0.69464128]
 [ 0.74369684 -0.47465426  0.         ..., -0.01307303 -0.02241625
  -0.02469471]
 [-1.21643018  2.10679663  0.         ..., -0.05958781 -0.0320065
  -0.03466538]
 ..., 
 [ 1.78417522 -0.47465426  0.         ..., -0.08736038 -0.08607566
  -0.08612693]
 [ 1.79088798 -0.47465426  0.         ..., -0.09417573 -0.09268962
  -0.09255963]
 [ 1.56936678 -0.47465426  0.         ..., -0.09417573 -0.09268962
  -0.09255963]]
(21477, 56)
[[-1.22985571 -0.47465426  0.         ...,  0.70850251  0.7026396
   0.69464128]
 [ 0.74369684 -0.47465426  0.         ..., -0.01307303 -0.02241625
  -0.02469471]
 [-1.21643018  2.10679663  0.         ..., -0.05958781 -0.0320065
  -0.03466538]
 ..., 
 [ 1.78417522 -0.47465426  0.         ..., -0.08736038 -0.08607566
  -0.08612693]
 [ 1.79088798 -0.47465426  0.         ..., -0.09417573 -0.09268962
  -0.09255963]
 [ 1.56936678 -0.47465426  0.         ..., -0.09417573 -0.09268962
  -0.092

In [107]:
from sklearn.model_selection import train_test_split
X = fao_train[['Item Code','Element','Unit']]
y = fao_train[['Y1961',
       'Y1962', 'Y1963', 'Y1964', 'Y1965', 'Y1966', 'Y1967', 'Y1968', 'Y1969',
       'Y1970', 'Y1971', 'Y1972', 'Y1973', 'Y1974', 'Y1975', 'Y1976', 'Y1977',
       'Y1978', 'Y1979', 'Y1980', 'Y1981', 'Y1982', 'Y1983', 'Y1984', 'Y1985',
       'Y1986', 'Y1987', 'Y1988', 'Y1989', 'Y1990', 'Y1991', 'Y1992', 'Y1993',
       'Y1994', 'Y1995', 'Y1996', 'Y1997', 'Y1998', 'Y1999', 'Y2000', 'Y2001',
       'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007', 'Y2008', 'Y2009',
       'Y2010', 'Y2011', 'Y2012', 'Y2013']]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,random_state=42)

print(X_train.shape)
print(X_val.shape)
print(y_train.shape)
print(y_val.shape)

(17181, 53)
(4296, 53)
(17181, 1)
(4296, 1)


In [108]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(random_state=42)

log_reg.fit(X_train, y_train)
test_set = df2_test.assign(prediction=log_reg.predict(\
                            df2_test))
print(log_reg.fit(X_train, y_train))

  y = column_or_1d(y, warn=True)


ValueError: Unknown label type: 'continuous'

In [94]:
# model fine-tuning

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(X_train, y_train)

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30],
     'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap
    # set as False
    {'bootstrap': [False],
     'n_estimators': [3, 10],
     'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of
# (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(\
                   forest_reg,
                   param_grid,
                   cv=5,
                   scoring='neg_mean_squared_error',
                   return_train_score=True,
                          )
grid_search.fit(X_train,
                y_train)

ValueError: max_features must be in (0, n_features]

In [95]:
# The best hyperparameter combination found:
from sklearn.model_selection import cross_val_score

print('best parameters:', grid_search.best_params_)

# The best model with above parameters
best_model = grid_search.best_estimator_
X_train_pred = best_model.predict(X_train)
print('MSE:', mean_squared_error(X_train_pred,
                                 y_train))
scores = cross_val_score(best_model,
                          X_train,
                          y_train,
                          cv=10,
                          scoring="neg_mean_squared_error")
print(scores)

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

In [96]:
# Analyze the best model
feature_importance = best_model.feature_importances_
attributes = X_train.columns
sorted(zip(feature_importance, attributes),
       reverse=True)

NameError: name 'best_model' is not defined

In [97]:
# Evaluate the model on test set

from sklearn.metrics import mean_squared_error

titanic_predictions = log_reg.predict(X_train)
log_mse = mean_squared_error(y_train, titanic_predictions)
print('MSE on training set', log_mse)

NotFittedError: This LogisticRegression instance is not fitted yet