# DASC5301 Data Science, Spring 2022, Chengkai Li, Unversity of Texas at Arlington
# Programming Assignment 3
# Solution 
# Due: Wednesday, May 11th, 2022, 11:59pm


## **Academic Honesty**
This assignment must be done individually and independently. You must implement the whole assignment by yourself. Academic dishonesty is not tolerated.

## **Requirements**

1. When you work on this assignment, you should make a copy of this notebook in Google Colab. This can be done using the option `File > Save a copy in Drive` in Google Colab. 

2. You should fill in your answer for each task inside the code block right under the task. 

3. You should only insert your code into the designated code blocks, as mentioned above. Other than that, you shouldn't change anything else in the notebook, unless otherwise instructed.

4.  For each code block, you are free to use multiple lines of code. 

5.   Even if you can only partially solve a task, you should include your code in the code block, which allows us to consider partial credit. 

6.   However, your code should not raise errors. Any code raising errors will not get partial credit. 

7.   We will test your code in Google Colab. Make sure your code runs in Google Colab.

8. Note that, although the code blocks are empty, you can see some outputs. These are outputs from our previous execution of the code, for your reference. If you run the Colab again without filling in the correct code, you will not see these outputs. You can always refer to the original assignment Colab to see these outputs. 

10. To submit your assignment, download your Colab into a .ipynb file. This can be done using the option `Download > Download .ipynb` in Google Colab.

11. Submit the downloaded .ipynb file into the Programming Assignment 3 entry in Canvas.

## **Part A: Web Scraping**

In this part of the assignment, we use BeautifulSoup to scrape population and environmental data from a website called Worldometer. 

## **Task 1: Extract table "World Population by Country" (25 points)**

From page https://www.worldometers.info/world-population/
, extract the table "World Population by Country" (as shown in the screenshot below) into a Pandas `Dataframe` named `country_pop_df`. Note that there are 235 countries to extract. 

![world_pop_country.jpg](https://drive.google.com/uc?id=1Kaf7_Lid751OruJ4DgkPsFCjyxoI4bXx)

In [None]:
# code for task 1

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_table(link, text):
  page = requests.get(link)
  soup = BeautifulSoup(page.content, 'html.parser')
  text_loc = soup.find("h2", text=text)
  table = text_loc.find_next('table')
  return table

def row_finder(table):
  rows = []
  for row in table.find_all('tr')[1:]:
    row = [r.text for r in row.find_all('td')]
    rows.append(row)
  return rows

def col_finder(table):
  columns = []
  for col in table.find_all('th'):
    columns.append(col.text)
  return columns

weblink = 'https://www.worldometers.info/world-population/'
country_pop = get_table(link=weblink, text='World Population by Country')
columns = col_finder(country_pop)
rows = row_finder(country_pop)
country_pop_df = pd.DataFrame(rows, columns = columns)
country_pop_df.drop('#', axis=1,inplace=True)

If your code is correct, the result of `country_pop_df.head()` should match the output below.

In [None]:
country_pop_df.head()

Unnamed: 0,Country (or dependency),Population(2020),YearlyChange,NetChange,Density (P/Km²),Land Area (Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare
0,China,1439323776,0.39 %,5540090,153,9388211,-348399,1.69,38,60.8 %,18.5 %
1,India,1380004385,0.99 %,13586631,464,2973190,-532687,2.2402,28,35 %,17.7 %
2,United States,331002651,0.59 %,1937734,36,9147420,954806,1.7764,38,82.8 %,4.2 %
3,Indonesia,273523615,1.07 %,2898047,151,1811570,-98955,2.3195,30,56.4 %,3.5 %
4,Pakistan,220892340,2 %,4327022,287,770880,-233379,3.55,23,35.1 %,2.8 %


## **Task 2: For every country, extract the most populous city and its population (15 points)**

The table "World Population by Country" from Task 1 includes a link to each country's page, which contains a table "Main Cities by Population ..." (e.g., see below for the screenshot of the table for USA). From there find the most populous city of the country and its population. Do this for all 235 countries from the table in Task 1. Save the data as a Pandas `Dataframe` named `pop_cities_df`. This `DataFrame` should have three columns --- `Country`, `Most Populous City` and `Most Populous City Population`. 

![main_cities.jpg](https://drive.google.com/uc?id=1El7rFWiohWlNplakVD6E_U6f7ZfKtgG-)

In [None]:
# code for task 2
homepage = 'https://www.worldometers.info/'

country_links = [r.find('a')['href'] for r in country_pop.find_all('tr')[1:]]
country_names = []
rows = []
for i, country in enumerate(country_pop_df['Country (or dependency)']):
  page = requests.get(homepage+country_links[i])
  soup = BeautifulSoup(page.content, 'html.parser')
  table_loc = soup.find_all("table")
  if len(table_loc)==4:
    table = table_loc[-1]
    rows.append(row_finder(table)[0][1:])
    country_names.append(country)
pop_cities_df = pd.DataFrame(rows,columns=['Most Populous City', 'Most Populous City Population'])
pop_cities_df.insert(0, 'Country', country_names)

If your code is correct, the result of `pop_cities_df.head()` should match the output below.

In [None]:
pop_cities_df.head()

Unnamed: 0,Country,Most Populous City,Most Populous City Population
0,China,Shanghai,22315474
1,India,Mumbai,12691836
2,United States,New York City,8175133
3,Indonesia,Jakarta,8540121
4,Pakistan,Karachi,11624219


## **Task 3: Merge the `DataFrames` from Tasks 1 and 2 into a single `DataFrame` (5 points)**
Merge `country_pop_df` and `pop_cities_df` into a single `DataFrame` `final_df`.

In [None]:
# code for task 3

final_df = pd.merge(country_pop_df, pop_cities_df, how='left', left_on='Country (or dependency)', right_on='Country') 
final_df.drop('Country', axis=1, inplace=True) 

If your code is correct, the results of `final_df.head()` and `final_df.info()` should match the outputs below.

In [None]:
final_df.head()

Unnamed: 0,Country (or dependency),Population(2020),YearlyChange,NetChange,Density (P/Km²),Land Area (Km²),Migrants(net),Fert.Rate,Med.Age,UrbanPop %,WorldShare,Country,Most Populous City,Most Populous City Population
0,China,1439323776,0.39 %,5540090,153,9388211,-348399,1.69,38,60.8 %,18.5 %,China,Shanghai,22315474
1,India,1380004385,0.99 %,13586631,464,2973190,-532687,2.2402,28,35 %,17.7 %,India,Mumbai,12691836
2,United States,331002651,0.59 %,1937734,36,9147420,954806,1.7764,38,82.8 %,4.2 %,United States,New York City,8175133
3,Indonesia,273523615,1.07 %,2898047,151,1811570,-98955,2.3195,30,56.4 %,3.5 %,Indonesia,Jakarta,8540121
4,Pakistan,220892340,2 %,4327022,287,770880,-233379,3.55,23,35.1 %,2.8 %,Pakistan,Karachi,11624219


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199 entries, 0 to 198
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   Country (or dependency)        199 non-null    object
 1   Population(2020)               199 non-null    object
 2   YearlyChange                   199 non-null    object
 3   NetChange                      199 non-null    object
 4   Density (P/Km²)                199 non-null    object
 5   Land Area (Km²)                199 non-null    object
 6   Migrants(net)                  199 non-null    object
 7   Fert.Rate                      199 non-null    object
 8   Med.Age                        199 non-null    object
 9   UrbanPop %                     199 non-null    object
 10  WorldShare                     199 non-null    object
 11  Country                        199 non-null    object
 12  Most Populous City             198 non-null    object
 13  Most 

## **Part B: Deep Learning**

In this part of the assignment we build deep learning models on a dataset about data science professionals. More specifically, the models will estimate whether a person earns an annual salary of more than $100,000 or not, using information about their skills, employement, seniority, and so on. 

To start, run the code cell below to download the data and load it into a pandas `DataFrame`.

In [None]:
import pandas as pd

!wget -O ds_semi_processed.csv "https://drive.google.com/uc?id=1d1ub9_6iaWZ7yjgNLDRaUmQ8hWKrC39H"
df = pd.read_csv('ds_semi_processed.csv')

--2022-05-04 02:32:03--  https://drive.google.com/uc?id=1d1ub9_6iaWZ7yjgNLDRaUmQ8hWKrC39H
Resolving drive.google.com (drive.google.com)... 64.233.189.102, 64.233.189.138, 64.233.189.101, ...
Connecting to drive.google.com (drive.google.com)|64.233.189.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-10-80-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/oasmaps4l88d3g6t7fuq7pk9qqs9fp5c/1651631475000/16694202268609887225/*/1d1ub9_6iaWZ7yjgNLDRaUmQ8hWKrC39H [following]
--2022-05-04 02:32:04--  https://doc-10-80-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/oasmaps4l88d3g6t7fuq7pk9qqs9fp5c/1651631475000/16694202268609887225/*/1d1ub9_6iaWZ7yjgNLDRaUmQ8hWKrC39H
Resolving doc-10-80-docs.googleusercontent.com (doc-10-80-docs.googleusercontent.com)... 142.250.157.132, 2404:6800:4008:c13::84
Connecting to doc-10-80-docs.googleusercontent.com (doc-10-80-docs.googleusercontent.com)|142.250.157.1

In [None]:
df.head()

Unnamed: 0,Age,Gender,Degree,Title,Industry of employer,Size of employer,State of employer in incorporate machine learning into business,Yearly salary > $100k,Years of coding experience,Years of experience in machine learning methods,...,Regularly use Scikit-learn,Regularly use TensorFlow,Regularly use Keras,Regularly use PyTorch,Regularly use Xgboost,Regularly use Linear or Logistic Regression,Regularly use Decision Trees or Random Forests,Regularly use Gradient Boosting Machines,Regularly use Bayesian Approaches,Regularly use Convolutional Neural Networks
0,55-59,Man,4,Software Engineer,Computers/Technology,1,recently started,No,5,5.0,...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1,30-34,Woman,4,Data Scientist,Other,5,recently started,No,4,6.0,...,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0
2,40-44,Man,3,Research Scientist,Manufacturing/Fabrication,4,I do not know,No,1,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,50-54,Man,4,Data Engineer,Computers/Technology,5,well established,Yes,6,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,35-39,Man,6,Research Scientist,Medical/Pharmaceutical,5,for insights only,Yes,5,7.0,...,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1845 entries, 0 to 1844
Data columns (total 27 columns):
 #   Column                                                           Non-Null Count  Dtype  
---  ------                                                           --------------  -----  
 0   Age                                                              1845 non-null   object 
 1   Gender                                                           1845 non-null   object 
 2   Degree                                                           1845 non-null   int64  
 3   Title                                                            1845 non-null   object 
 4   Industry of employer                                             1845 non-null   object 
 5   Size of employer                                                 1845 non-null   int64  
 6   State of employer in incorporate machine learning into business  1845 non-null   object 
 7   Yearly salary > $100k                     

## **B.1 Data Preprocessing**

The dataset has several categorical attributes. In deep learning, there are limited ways of building models that directly work with catagorical attributes. We need to preprocess such attributes before we can build and evaluate models. More specifically, we need to encode such attributes in numeric values.

### **(1) Representing the label column using `LabelEncoder()`**

## **Task 4: Convert `Yearly salary > $100k` into an interger attribute. (5 points)**

`Yearly salary > $100k` will be the class label attribute for our classification task. It currently has the values `Yes` and `No`. We need to convert the values into 1 and 0. (Note that it is not important which value becomes 1.) We will do it using *label encoding* (`preprocessing.LabelEncoder()` from the `sklearn` library), which assigns one distinct integer to each distinct attribute value. We used `LabelEncoder()` for another task in Programming Assignment 1. You can refer to it.

In [None]:
# code for task 4

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

label_encoder = LabelEncoder()
df['Yearly salary > $100k'] = label_encoder.fit_transform(df['Yearly salary > $100k'])

### **(2) Representing ordinal attributes using `OrdinalEncoder()`**

## **Task 5: Convert `Age` into an integer attribute. (5 points)**

The original values in column `Age` are `25-29`, `30-34`, `35-39`, and so on. This is an ordinal attribute, but a model cannot understand the meaning of these values and the order relationship between the values. Therefore, we'd like to transform the values in this column into integers 0,1,2,3, and so on. In this way we preserve the ordinality of the values and the ordinal relationship can be recognized by a deep learning model.

To achieve that, we will use *ordinal encoding* (`preprocessing.OrdinalEncoder()` from the `sklearn` library). `OrdinalEncoder()` assigns integer values based on the lexicographical order of values in `Age`. Since the lexicographical order is consistent with the ordinal relationship we intend to keep (e.g., `25-29` is less than `30-34`), `OrdinalEncoder()` will simply just work. (If the lexicographical order is inconsistent with the ordinal relationship, there is a way to explicitly specify the order among values that `OrdinalEncoder()` should recognize. But that is not an issue here.)

In [None]:
# code for task 5

from sklearn.preprocessing import OrdinalEncoder

ordinalencoder = OrdinalEncoder()
df['Age'] = ordinalencoder.fit_transform(df[['Age']])

### **(3) Representing nominal attributes using `OneHotEncoder()`**

The column `Most frequently used big data products` describes the big data product that a person uses most frequently. It has values such as `MySQL`, `PostgreSQL`, and so on. Based on what we learned earlier in the semester, this is a nominal attribute in that there isn't a meaningful order among the attribute values. We will use `one-hot encoding` (`preprocessing.OneHotEncoder()` from the `sklearn` library) to represent this attribute. More specifically, we will make one new binary-value column for each distinct big data product. A row has value `1` or `0` in that new column, based on its value in the original `Most frequently used big data products` column. We learned about the concept of one hot encoding in the [deep learning colab](https://colab.research.google.com/drive/1anftTmeq5cJ5Dzn0dYX0YBIIr0rPDUPw?usp=sharing) and the corresponding [lecture](https://echo360.org/media/40304de2-f433-4484-bb34-aab526882729/public).  

Apply the following code to perform one-hot encoding on `Most frequently used big data products`. After that, the results of `df.head(15)` show the new columns, each with the prefix `Bigd`. Note that we also dropped the original column `Most frequently used big data products`.



In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe_bigd = OneHotEncoder()
bigd_arr = ohe_bigd.fit_transform(df['Most frequently used big data products'].to_numpy().reshape(-1, 1))
bigd = pd.DataFrame(bigd_arr.todense().astype(int), columns=ohe_bigd.get_feature_names_out(['bigd']))
df = pd.concat([df, bigd], axis=1)
df = df.drop(['Most frequently used big data products'], axis=1)

df.head(15)

Unnamed: 0,Age,Gender,Degree,Title,Industry of employer,Size of employer,State of employer in incorporate machine learning into business,Yearly salary > $100k,Years of coding experience,Years of experience in machine learning methods,...,bigd_Microsoft Azure SQL Database,bigd_Microsoft SQL Server,bigd_MongoDB,bigd_MySQL,bigd_Oracle Database,bigd_Other,bigd_PostgreSQL,bigd_SQLite,bigd_Snowflake,bigd_nan
0,8.0,Man,4,Software Engineer,Computers/Technology,1,recently started,0,5,5.0,...,0,0,0,1,0,0,0,0,0,0
1,3.0,Woman,4,Data Scientist,Other,5,recently started,0,4,6.0,...,0,0,0,0,0,0,0,0,0,1
2,5.0,Man,3,Research Scientist,Manufacturing/Fabrication,4,I do not know,0,1,1.0,...,0,0,0,0,0,0,0,0,0,1
3,7.0,Man,4,Data Engineer,Computers/Technology,5,well established,1,6,1.0,...,0,0,0,1,0,0,0,0,0,0
4,4.0,Man,6,Research Scientist,Medical/Pharmaceutical,5,for insights only,1,5,7.0,...,0,0,0,0,0,0,1,0,0,0
5,2.0,Man,4,Data Scientist,Online Service/Internet-based Services,4,recently started,0,2,3.0,...,0,0,0,0,0,0,0,0,0,1
6,8.0,Man,3,Software Engineer,Other,2,for insights only,0,6,7.0,...,0,1,0,0,0,0,0,0,0,0
7,3.0,Man,4,Business Analyst,Accounting/Finance,5,recently started,0,2,2.0,...,0,0,0,0,0,0,0,0,0,1
8,8.0,Man,3,Other,Medical/Pharmaceutical,4,exploring,0,2,1.0,...,0,0,0,0,0,0,0,0,0,1
9,5.0,Woman,4,Other,Academics/Education,2,recently started,0,4,6.0,...,0,0,0,0,0,0,0,0,0,1


Notice we get a column named `bigd_nan`. This is because OneHotEncoder deals with null values as if they are another unique value and assigns a vector to these values too. Essentially dealing with null; the column `bigd_nan` will contain 1 for rows which had nan in original column and 0 for any other entry.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1845 entries, 0 to 1844
Data columns (total 47 columns):
 #   Column                                                           Non-Null Count  Dtype  
---  ------                                                           --------------  -----  
 0   Age                                                              1845 non-null   float64
 1   Gender                                                           1845 non-null   object 
 2   Degree                                                           1845 non-null   int64  
 3   Title                                                            1845 non-null   object 
 4   Industry of employer                                             1845 non-null   object 
 5   Size of employer                                                 1845 non-null   int64  
 6   State of employer in incorporate machine learning into business  1845 non-null   object 
 7   Yearly salary > $100k                     

The dataset is semi-processed. Some of the columns, e.g., `Size of employer` and `Degree`, are already converted into integer attributes. Furthermore, columns starting with `Regularly use ...` are the outcome of applying one-hot encoding on an original attribute about data science tools regularly used by people. 

The remaining attributes still need to be transformed using one-hot encoding, in the same way in which `Most frequently used big data products` was transformed above. 

## **Task 6: Apply one-hot encoding on nominal attributes `Title`, `Gender`, `Industry of employer`, `State of employer in incorporate machine learning into business`, `Most frequently used data science platform`, and `Primary tool for analyzing data`. (10 points)**

In [None]:
# code for task 6

from sklearn.preprocessing import OneHotEncoder

ohe_title = OneHotEncoder()
title_arr = ohe_title.fit_transform(df['Title'].to_numpy().reshape(-1, 1))
title = pd.DataFrame(title_arr.todense().astype(int), columns=ohe_title.get_feature_names_out(['title']))
df = pd.concat([df, title], axis=1)
df = df.drop(['Title'], axis=1)

ohe_gender = OneHotEncoder()
gender_arr = ohe_gender.fit_transform(df['Gender'].to_numpy().reshape(-1, 1))
gender = pd.DataFrame(gender_arr.todense().astype(int), columns=ohe_gender.get_feature_names_out(['gender']))
df = pd.concat([df, gender], axis=1)
df = df.drop(['Gender'], axis=1)

ohe_industry = OneHotEncoder()
industry_arr = ohe_industry.fit_transform(df['Industry of employer'].to_numpy().reshape(-1, 1))
industry = pd.DataFrame(industry_arr.todense().astype(int), columns=ohe_industry.get_feature_names_out(['industry']))
df = pd.concat([df, industry], axis=1)
df = df.drop(['Industry of employer'], axis=1)

ohe_state = OneHotEncoder()
state_arr = ohe_state.fit_transform(df['State of employer in incorporate machine learning into business'].to_numpy().reshape(-1, 1))
state = pd.DataFrame(state_arr.todense().astype(int), columns=ohe_state.get_feature_names_out(['state']))
df = pd.concat([df, state], axis=1)
df = df.drop(['State of employer in incorporate machine learning into business'], axis=1)

ohe_platform = OneHotEncoder()
platform_arr = ohe_platform.fit_transform(df['Most frequently used data science platform'].to_numpy().reshape(-1, 1))
platform = pd.DataFrame(platform_arr.todense().astype(int), columns=ohe_platform.get_feature_names_out(['platform']))
df = pd.concat([df, platform], axis=1)
df = df.drop(['Most frequently used data science platform'], axis=1)

ohe_tool = OneHotEncoder()
tool_arr = ohe_tool.fit_transform(df['Primary tool for analyzing data'].to_numpy().reshape(-1, 1))
tool = pd.DataFrame(tool_arr.todense().astype(int), columns=ohe_tool.get_feature_names_out(['tool']))
df = pd.concat([df, tool], axis=1)
df = df.drop(['Primary tool for analyzing data'], axis=1)

In [None]:
df.head()

Unnamed: 0,Age,Degree,Size of employer,Yearly salary > $100k,Years of coding experience,Years of experience in machine learning methods,Experience with TPU,Regularly use Python,Regularly use R,Regularly use SQL,...,platform_desktop,platform_laptop,platform_nan,tool_Advanced statistical software,tool_Basic statistical software,tool_Business intelligence software,tool_Cloud-based data software & APIs,tool_Local development environments,tool_Other,tool_nan
0,8.0,4,1,0,5,5.0,1.0,1.0,0.0,1.0,...,0,0,0,0,0,0,1,0,0,0
1,3.0,4,5,0,4,6.0,1.0,1.0,0.0,1.0,...,0,0,0,0,0,0,1,0,0,0
2,5.0,3,4,0,1,1.0,1.0,0.0,0.0,0.0,...,0,1,0,1,0,0,0,0,0,0
3,7.0,4,5,1,6,1.0,4.0,1.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
4,4.0,6,5,1,5,7.0,3.0,1.0,1.0,1.0,...,1,0,0,0,0,0,0,1,0,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1845 entries, 0 to 1844
Data columns (total 96 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   Age                                              1845 non-null   float64
 1   Degree                                           1845 non-null   int64  
 2   Size of employer                                 1845 non-null   int64  
 3   Yearly salary > $100k                            1845 non-null   int64  
 4   Years of coding experience                       1845 non-null   int64  
 5   Years of experience in machine learning methods  1845 non-null   float64
 6   Experience with TPU                              1845 non-null   float64
 7   Regularly use Python                             1845 non-null   float64
 8   Regularly use R                                  1845 non-null   float64
 9   Regularly use SQL             

## **B.2 Load Preprocessed Data**

If you couldn't get the previous tasks done, don't panic. We provide a preprocessed file **ds_processed.csv** to you. You just need to run the following code to load it. In fact, you should use this preprocessed data file regardless, even if you successfully finish your previous tasks. This way we make sure everyone uses the same data file for creating the deep learning models, which allows us to fairly grade all submissions.

In [None]:
import pandas as pd
!wget -O ds_processed.csv "https://drive.google.com/uc?id=182hMFczb7vvOl5VzaXrZvPjWsCpiGUlr"
df = pd.read_csv('ds_processed.csv')

--2022-05-09 02:03:03--  https://drive.google.com/uc?id=182hMFczb7vvOl5VzaXrZvPjWsCpiGUlr
Resolving drive.google.com (drive.google.com)... 209.85.145.101, 209.85.145.100, 209.85.145.113, ...
Connecting to drive.google.com (drive.google.com)|209.85.145.101|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-14-80-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/gk9of8tfmt1gaa74a2lf3f0aim4k5a5t/1652061750000/16694202268609887225/*/182hMFczb7vvOl5VzaXrZvPjWsCpiGUlr [following]
--2022-05-09 02:03:03--  https://doc-14-80-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/gk9of8tfmt1gaa74a2lf3f0aim4k5a5t/1652061750000/16694202268609887225/*/182hMFczb7vvOl5VzaXrZvPjWsCpiGUlr
Resolving doc-14-80-docs.googleusercontent.com (doc-14-80-docs.googleusercontent.com)... 142.251.120.132, 2607:f8b0:4001:c2e::84
Connecting to doc-14-80-docs.googleusercontent.com (doc-14-80-docs.googleusercontent.com)|142.251.120.1

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1845 entries, 0 to 1844
Data columns (total 96 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   Age                                              1845 non-null   float64
 1   Degree                                           1845 non-null   int64  
 2   Size of employer                                 1845 non-null   int64  
 3   Yearly salary > $100k                            1845 non-null   int64  
 4   Years of coding experience                       1845 non-null   int64  
 5   Years of experience in machine learning methods  1845 non-null   float64
 6   Experience with TPU                              1845 non-null   float64
 7   Regularly use Python                             1845 non-null   float64
 8   Regularly use R                                  1845 non-null   float64
 9   Regularly use SQL             

## **B.3 Build Deep Learning Models**

### (1) Splitting Data

We have now preprocessed our data. All columns are numeric and there are no missing values. We will use `Yearly salary > $100k` as the label column y and the remaining columns as features X. Run the following code to split the dataset into training and test sets, and split the columns into features and labels. Note that we hold out 20\% of the data as the test set. 

In [None]:
from sklearn.model_selection import train_test_split

features = df.loc[:,~df.columns.isin(['Yearly salary > $100k'])]
labels = df[['Yearly salary > $100k']]

X_train, X_test, y_train, y_test = train_test_split(features.values, labels.values, test_size=0.20, stratify=labels, random_state=42)

### (2) Network Structure of the Neural Network Model

We use Keras on top of Google TensorFlow to build our deep learning models. TensorFlow 2 is an end-to-end, open-source machine learning platform. Keras is the high-level API of TensorFlow 2. (Source: [link](https://keras.io/about/)) You can read more about TensorFlow at https://www.tensorflow.org/about and about Keras at https://keras.io/about/

Before we do anything else, let's import the modules we will need. We also define several global variables that we will need to use in ensuing tasks. 

In [None]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Input
from keras.layers import Dense
from sklearn.metrics import mean_squared_error
input_dim = 0
output_dim = 0
middle_dim = 0

 
In TensorFlow, we can manually add layers to a neural network and we can specify the loss function, the optimizer, and the optimizer hyperparameters. The code below creates a simple model with an input layer and a dense output layer. Note that at this moment we are not including any hidden layer or any activation function. Also note that the function `create_model()` has an argument `hidden_layer_dim` which is not used at all. It is included to facilitate Task 10 (in case when you couldn't write the correct `create_model()` in Task 9.)

In [None]:
def create_model(hidden_layer_dim=0):
  model = Sequential()
  model.add(Input(shape=(input_dim,)))
  model.add(Dense(output_dim))
  return model

input_dim = X_train.shape[1]
output_dim = 1
model = create_model()

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 1)                 96        
                                                                 
Total params: 96
Trainable params: 96
Non-trainable params: 0
_________________________________________________________________


The dimensionality of the input layer, i.e., the number of nodes of this layer, is equal to the number of features plus the bias, corresponding to the artificial, constant feature 1. From the output earlier, we know this dataset has 96 columns, including the label `Yearly salary > $100k`. Hence, there are 95 features. Therefore, after including the bias, the dimensionality of the input layer is 96. That is why the output above shows the number of parameters (i.e., weights) to the output layer as 96.

In a typical classification setup, the dimensionality of the output layer is equal to the number of classes (refer to the section of softmax/logistic regression in the [deep learning colab](https://colab.research.google.com/drive/1anftTmeq5cJ5Dzn0dYX0YBIIr0rPDUPw?usp=sharing) that we went through in lectures). However, given that we are doing binary classification in this task, we can use just one node for the output layer. This is because, if a model's prediction on one of the two classes is high, then its prediction on the other class is low. Specifically, if we use the sigmoid function as the activation on top of the output layer, the output is a probability $p$ for the positive class (i.e., the class corresponding to the target label `1`) and the probability for the negative class is $1-p$. 

### (3) Model Training: Loss Function, Optimizer, and Hyperparameters

After defining the network structure of the model, the following code chooses `mean_squared_error` as the loss function. There are many different loss functions supported by Keras, as listed at the [Keras API page](https://keras.io/api/losses/). Furthermore, the code choses `SGD` as the optimizer. It is actually the minibatch stochastic gradient descent we explained in our lectures, as it accepts different batch sizes. The hyperparameter learning rate 0.01 is given as an argument when SGD is chosen as the optimizer. There are many different optimizing algorithms apart from SGD, as listed at the [Keras API page](https://keras.io/api/optimizers/).  

Given the above definition of the model, we fit the model to the training data using the `fit()` method, with two arguments to specify hyperparameters batch size as 32 and number of epochs as 50. The dataset has 1845 examples. Given the 80\%/20\% training/test ratio, the training set has 1476 examples. Given the batch size 32, there are 47 batches. Each of the first 46 batches has 32 examples, and the last batch has the remaining 4 examples. 

In [None]:
optim = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(loss='mean_squared_error', optimizer=optim)
model.fit(X_train, y_train, batch_size=32, epochs=50, shuffle=True)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f042e0e9750>

The output is an epoch-by-epoch update of the training progress. For each epoch, it displays the epoch index number (e.g., `Epoch 8/50`), below which it displays a progress bar along with the batch index number (e.g., '47/47'). Given the small size of our dataset, we can't really see the progres and batch index number varying from batch 1 to batch 47. You can find many [Keras code examples](https://keras.io/examples/) that use much larger datasets and thus the progress will be clearly observable. Beside the progress bar, `0s` is the amount of time taken for the epoch, `2ms/step` is the time taken per batch, and `loss: 0.3028` is the loss on the training data after the epoch. We observe that the training loss steadily decreases in about the first 25-30 epochs, after which it goes back and forth around certain range and thus can't be further improved. 

### (3) Make Predictions Using the Trained Model

After training the model, the following code makes predictions using the `predict()` method and reports the model's performance in mean squared error (MSE). Note that the values in `y_pred` could be any numeric value and not just `0` or `1` as is the case with the true labels in `y_test`. 

In [None]:
y_pred = model.predict(X_test)

print('Mean Squared Error: ', mean_squared_error(y_test, y_pred))

Mean Squared Error:  0.22147876169692474


## **Task 7: Activation function (5 points)**

Let's introduce non-linearity into the model. When adding a dense layer (including the output layer) using the `tf.keras.layers.Dense()` class, you can specify the activation on top of the dense layer as an argument `activation=CHOICE_OF_ACTIVATION` of `tf.keras.layers.Dense()`. Alternatively, you can add an activation layer above the dense layer using `tf.keras.layers.Activation()`. In both approaches, you can specify an activation function of your choice, e.g., sigmoid, relu, etc. You can find further documentation here: https://keras.io/api/layers/activations/

Since we are doing binary classification, we want the output to be a probability. Hence, the activation function for the output layer should be sigmoid, as we explained in lectures. 

With the sigmoid activation function on the output node, `y_pred` gets values within the range of [0,1], which is different from the binary values (either 0 or 1) in `y_test`. In order to measure the model's accuracy in addition to the loss, we use the following simple function `accuracy()` to compute the classification labels `pred` (an output from the sigmoid function less than 0.5 means the example is classified as negative, otherwise positive) and further caculate classification accuracy. 

We encapsulated the code for setting the dimensionality of various layers in method `set_dim()`. We also encapsulated the code for training the model, using the model to classify, and reporting accuracy in a new method `fit_predict()`. You only need to fill out the code within method `create_model()`. The argument of this function is ` hidden_layer_dim` which will be necessary for ensuing tasks. In this task, though, you can just let `hidden_layer_dim` being passed through as an argument, but you don't need to do anything about it inside `create_model()`. 

In [None]:
# DO NOT MAKE ANY CHANGE TO THIS CODE BLOCK

from sklearn.metrics import accuracy_score
import numpy as np

# Method for calculating classification accuracy. 
def accuracy(y_test, y_pred):
  pred = np.array([1 if value>=0.5 else 0 for value in y_pred])
  return accuracy_score(y_test, pred)

# Method for setting the dimensionality of various layers.
def set_dim():
  global input_dim, output_dim, middle_dim
  input_dim = X_train.shape[1]
  output_dim = 1
  middle_dim = 20

# Method for training models and making predictions using the models. 
def fit_predict(model):
  optim = tf.keras.optimizers.SGD(learning_rate=0.001)
  model.compile(loss='mean_absolute_error', optimizer=optim)
  model.fit(X_train, y_train, batch_size=300, epochs=15, verbose=0, shuffle=True)

  y_pred = model.predict(X_test)

  print('Mean Squared Error: ', mean_squared_error(y_test, y_pred))
  print('Accuracy: ', accuracy(y_test, y_pred))

In [None]:
# code for task 7

def create_model(hidden_layer_dim):
  model = Sequential()
  model.add(Input(shape=(input_dim,)))

  # The code you are adding should be inside the function create_model()
  # DO NOT CHANGE ANYTHING ABOVE. FILL IN YOUR CODE BELOW.

  model.add(Dense(output_dim, activation='sigmoid'))

  # DO NOT CHANGE ANYTHING BELOW. FILL IN YOUR CODE ABOVE.
  return model

set_dim()
model = create_model(hidden_layer_dim=middle_dim)
fit_predict(model)

Mean Squared Error:  0.13249322621997167
Accuracy:  0.8130081300813008


## **Task 8: Addition of hidden layer (5 points)**

Let's further change the model from task 7 by introducing a hidden layer with non-linearity. You can add a hidden layer to a model using `tf.keras.layers.Dense()`. Use 20 nodes for this hidden layer and activate the layer with the ReLU function. The value 20 is passed into `create_model()` through argument `hidden_layer_dim`. You can find further documentation about adding layers in Keras from the following page: https://keras.io/api/models/sequential/#add-method

In [None]:
# code for task 8 

def create_model(hidden_layer_dim):
  model = Sequential()
  model.add(Input(shape=(input_dim,)))

  # The code you are adding should be inside the function create_model()
  # DO NOT CHANGE ANYTHING ABOVE. FILL IN YOUR CODE BELOW.

  model.add(Dense(hidden_layer_dim, activation='relu'))
  model.add(Dense(output_dim, activation='sigmoid'))

  # DO NOT CHANGE ANYTHING BELOW. FILL IN YOUR CODE ABOVE.
  return model

set_dim()
model = create_model(hidden_layer_dim=middle_dim)
fit_predict(model)

Mean Squared Error:  0.23473588903766987
Accuracy:  0.7317073170731707


## **Task 9: Addition of L2 Reularization (5 points)**

We now introduce L2 regularization into the model from task 8. More specifically, let's add regularization to the hidden dense layer by using `kernel_regularizer` argument of `tf.keras.layers.Dense()`. Specify L2 as the regularizer, with a regularization factor (i.e., regularization rate) of 0.01. You can find further documentation here: https://keras.io/api/layers/regularizers/

In [None]:
# code for task 9

def create_model(hidden_layer_dim):
  model = Sequential()
  model.add(Input(shape=(input_dim,)))

  # The code you are adding should be inside the function create_model()
  # DO NOT CHANGE ANYTHING ABOVE. FILL IN YOUR CODE BELOW.

  model.add(Dense(hidden_layer_dim, kernel_regularizer=tf.keras.regularizers.L2(0.01), activation='relu'))
  model.add(Dense(output_dim, activation='sigmoid'))

  # DO NOT CHANGE ANYTHING BELOW. FILL IN YOUR CODE ABOVE.
  return model

set_dim()
model = create_model(hidden_layer_dim=middle_dim)
fit_predict(model)

Mean Squared Error:  0.23944907975832677
Accuracy:  0.6558265582655827


## **Task 10: Model Selection by Grid Search with Cross Validation, and Model Evaluation (20 points)**

In this task, we use grid search with cross validation for model selection. Refer to our lectures and instructional colabs on model selection and evaluation to refresh your memory about these topics. The code of grid search is provided to you below. The hyperparameters that are being fine-tuned are specified in `param_grid`. The code also reports the best hyperparameter choices from the grid search, i.e., the hyperparameter values that lead to the best average accuracy on validation sets across the multiple iterations of cross validation. 

Note that, for the following grid search code to work, your method `create_model()` from Task 9 must work. If you can't solve Task 9, you can copy `create_model()` from Task 8 or 7 into Task 9 so that at least the grid search will run. If you can't solve any of the 3 tasks, at least copy `create_model()` from `(2) Network Structure of the Neural Network Model` under B.3 into Task 9. 

In [None]:
# DO NOT MAKE ANY CHANGE TO THIS CODE BLOCK

!pip install scikeras

from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV

set_dim()
model = KerasClassifier(model=create_model, hidden_layer_dim=middle_dim)

param_grid = {'hidden_layer_dim': [15, 25],
              'loss': ['mean_squared_error', 'binary_crossentropy'],
              'optimizer': ['sgd', 'adam'],
              'optimizer__learning_rate': [0.1, 0.01],
              'epochs' : [20, 50],
              'batch_size' : [150, 250]
              }

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit=False, cv=3, scoring='accuracy')
grid_result = gs.fit(X_train, y_train)

print("Best model: accuracy %f, using hyperparameters %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
  print("Accuracy %f (standard deviation %f), using hyperparameters %r" % (mean, stdev, param))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Ep


Given the chosen hyperparameters, your task is to write code to train a model on the original training set using the hyperparameters. The original training set is 80\% of the dataset after training/test split, which was further partitioned into training set + validation set during cross validation. 

After the above required model is trained, evaluate the model using the test set. You should report the MSE and accuracy of the model, as `fit_predict()` from Task 7 did.

You should reuse the method `create_model()` from Task 9 since we are not changing the neural network architecture, activation function, or regularization. Hence, in Task 10 (the current task) you shouldn't define `create_model()` again. 

What you need to do is to recreate `set_dim()` and `fit_predict()` which was given to you in Task 7. Recreate them based on the best hyperparameters from the grid search. Given the rewritten `set_dim()` and `fit_predict()`, just call `set_dim()`, `create_model()`, and `fit_predict()`, in that order, as the code template below shows. 

In [None]:
# code for task 

# Method for setting the dimensionality of various layers.
def set_dim():
  global input_dim, output_dim, middle_dim
  input_dim = X_train.shape[1]
  output_dim = 1
  middle_dim = 15

# Method for training models and making predictions using the models. 
def fit_predict(model):
  optim = tf.keras.optimizers.Adam(learning_rate=0.01)
  model.compile(loss='mean_squared_error', optimizer=optim)
  model.fit(X_train, y_train, batch_size=150, epochs=50, verbose=0, shuffle=True)

  y_pred = model.predict(X_test)

  print('Mean Squared Error: ', mean_squared_error(y_test, y_pred))
  print('Accuracy: ', accuracy(y_test, y_pred))

# DO NOT CHANGE ANYTHING BELOW. FILL IN YOUR CODE ABOVE.
set_dim()
model = create_model(hidden_layer_dim=middle_dim)
fit_predict(model)

Mean Squared Error:  0.16608638270656578
Accuracy:  0.7669376693766937
