## Course Information
INFO 521: Introduction to Machine Learning\
Instructor: Xuan Lu, College of Information Science

## Instructions
#### Objectives
This worksheet will assess your knowledge of basic commands in Python. Please review the lectures, suggested readings, and additional resources before starting the homework, as this document closely follows the provided materials.

#### Grading
Please note that grades are **NOT exclusively based on your final answers**. We will be grading the overall structure and logic of your code. Feel free to use as many lines as you need to answer each of the questions. I also highly recommend and strongly encourage adding comments (`#`) to your code. Comments will certainly improve the reproducibility and readability of your submission. Commenting your code is also good coding practice. **Specifically for the course, you’ll get better feedback if the TA is able to understand your code in detail.**

__Total score__: 100 points.

#### Submission
This homework is due on **Feb 3rd (Tuesday, 11:59 pm AZ time)**. Please contact the instructor if you are (i) having issues opening the assignment, (ii) not understanding the questions, or (iii) having issues submitting your assignment. Note that late submissions are subject to a penalty (see late work policies in the syllabus).
- Please submit a single Jupyter Notebook file (this file). Answers to each question should be included in the relevant block of code (see below). Rename your file to "**lastname_Hw2.ipynb**" before submitting. If a given block of code is causing issues and you didn't manage to fix it, please add comments.

#### Time commitment
Please reach out if you’re taking more than ~18h to complete (1) this homework, (2) reading the book chapters, and (3) going over the lectures. I will be happy to provide accommodations if necessary. **Do not wait until the last minute to start working on this homework**. In most cases, working under pressure will certainly increase the time needed to answer each of these questions and the instructor and the TA might not be 100% available on Sundays to troubleshoot with you.

#### Looking for help?
First, please go over the relevant readings for this week. Second, if you’re still struggling with any of the questions, do some independent research (e.g. stackoverflow is a wonderful resource). Don’t forget that your classmates will also be working on the same questions - reach out for help (check under the Discussion forum for folks looking to interact with other students in this class or start your own thread). Finally, the TA is available to answer any questions during office hours and via email.

## Questions
#### Author:
Name: [Your name]\
Affiliation: [Your affiliation]

### Conceptual

#### Question 1

For each of four scenarios listed below, indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

- The sample size (number of observations; n) is extremely large, and the number of predictors (features; p) is small.

> **_Answer:_**  [BEGIN SOLUTION].

- The number of predictors (p) is extremely large, and the number of observations (n) is small.

> **_Answer:_**  [BEGIN SOLUTION].

- The relationship between the predictors and response is highly non-linear.

> **_Answer:_**  [BEGIN SOLUTION].

- The variance of the error terms (i.e. $σ^2 = Var(ϵ)$) is extremely high.

> **_Answer:_**  [BEGIN SOLUTION].

#### Question 2

In a few sentences, please answer the following questions to the best of your knowledge. Feel free to conduct additional research and cite your sources.

- Briefly explain the “curse of dimensionality”, provide a hypothetical example illustrating the concept, and list at least one potential way that is generally used to handle it when using machine learning models.

> **_Answer:_**  [BEGIN SOLUTION].

- Explain how dimensionality increases in each of the following cases and describe the consequences for learning:
    - Adding Features: Increasing the number of input features while keeping the model linear.
    - Polynomial Expansion: Increasing the polynomial degree of a model while keeping the original input features fixed.

> **_Answer:_**  [BEGIN SOLUTION].

- Explain why the relationship between model error and complexity differs when patterns are examined using training and test datasets.

> **_Answer:_**  [BEGIN SOLUTION].

- To the best of your knowledge, distinguish between training, test, and validation datasets. Briefly describe the importance of each in the context of Machine Learning.

> **_Answer:_**  [BEGIN SOLUTION].

- Briefly discuss the consequences of overfitting and underfitting.

> **_Answer:_**  [BEGIN SOLUTION].

#### Question 3

Explain whether each of the scenarios presented below is a classification or regression problem. Indicate whether the situation is mostly interested in conducting inference (explaining patterns) or prediction. Finally, indicate the number of observations (n) and features (p) associated with each of the .

- We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

> **_Answer:_**  [BEGIN SOLUTION].

- We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched by different companies. For each product we have recorded its type, whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

> **_Answer:_**  [BEGIN SOLUTION].

- We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for a given year. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

> **_Answer:_**  [BEGIN SOLUTION].

#### Question 4 

Let's now revisit the bias-variance decomposition.

- Provide a sketch of typical (squared) bias, variance, training error, test error, and irreducible (sometimes called Bayes error) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. __There should be five main curves. Make sure to label each one__. Now, add two arrows (parallel to the X axis) indicating the direction of increase in over- and under-fitting, respectively. Finally, label the point, also in the X axis, where model complexity is optimal. [_Note: Please either draw your sketch on a piece of paper and then scan or take a photo. If you choose to do this, please submit the image file along with your homework. Feel free to find a way to insert the image into this file. :)_].

> **_Answer:_**  [BEGIN SOLUTION].


- Explain what the conceptual importance of acknowledging the existence of irreducible error is in the context of Machine Learning. 

> **_Answer:_**  [BEGIN SOLUTION].


- Finally, briefly explain why each of the __five curves__ has the shape displayed.

> **_Answer:_**  [BEGIN SOLUTION].

#### Question 5
You will now think of some real-life applications for statistical learning.

- Describe three real-life applications in which classification might be useful. Describe the response variable, as well as the potential predictors. Is the goal of each application inference or prediction? Explain your answer.

> **_Answer:_**  [BEGIN SOLUTION].

- Describe three real-life applications in which regression  might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

> **_Answer:_**  [BEGIN SOLUTION].

#### Question 6

Let’s now review some ethical considerations associated with the development and implementation of Machine Learning algorithms. Please answer each question with as much detail as possible (one or two short paragraphs per question should suffice). Feel free to conduct additional research if you think it’s necessary. I’m looking for well-supported arguments. Your grade won’t be based on whether the instructor and TA agrees or disagrees with your position. Instead, do your best to provide a thoughtful and clear answer.

- From your perspective, describe how unbiased must an algorithm be before it can be deployed in the “real world”? Is that level commonly achieved, discussed, or examined?

> **_Answer:_**  [BEGIN SOLUTION].


- Research labs and companies generally invest the most in improving and developing ML algorithms. Do you think that society in general should also have immediate and equal access to these developments and to their benefits? How would you balance these two goals (e.g. profit and well-being)? Similarly, do you think ML already affects (or will affect) inequalities?

> **_Answer:_**  [BEGIN SOLUTION].

- What do you see as a solution for problems associated with increasing automation and efficiency, both currently and in the future?

> **_Answer:_**  [BEGIN SOLUTION].


- We discussed different trade-offs between models in class – one of these was related to model interpretability. Why do you think it is important for machine learning models to be interpretable? Provide at least two reasons. Among the reasons that you listed, which one do you believe is the most important? Explain.

> **_Answer:_**  [BEGIN SOLUTION].

- Now, between model performance and model interpretability: which of these two qualities do you think is more important. Explain.

> **_Answer:_**  [BEGIN SOLUTION].


- Particular examples in history show that different algorithms and datasets were originally developed/simulated/compiled with goals related to increasing systemic inequalities. For instance, let’s take a quick look at the _iris_ dataset (__Fisher 1936__). First, how many hits do you get after searching for _iris_ dataset tutorial in Google (feel free to try any similar or more systematic search queries)? Based on this search, list at least two different modern uses of the dataset. Now, go over __Kozak & Łotocka (2013)__ and briefly talk about the biological realism of the dataset. Briefly comment on your thoughts related to why _Annals of eugenics_ showed interested in this dataset. Finally, given the widespread availability of data, do you think that the very frequent use of the _iris_ dataset in the field sends a particular message to society?

Kozak, M., & Łotocka, B. (2013). __What should we know about the famous Iris data__. Current Science, 104(5), 579-580.\
Fisher, R. A. (1936). __The use of multiple measurements in taxonomic problems__. Annals of eugenics, 7(2), 179-188.

> **_Answer:_**  [BEGIN SOLUTION].

### Applied

#### Question 7

This exercise relates to the `College` data set, which can be found in the file `College.csv`. It contains a number of variables for 777 different universities and colleges in the US. The variables are:

- `Private` : Public/private indicator
- `Apps` : Number of applications received
- `Accept` : Number of applicants accepted
- `Enroll` : Number of new students enrolled
- `Top10perc` : New students from top 10 % of high school class
- `Top25perc` : New students from top 25 % of high school class
- `F.Undergrad` : Number of full-time undergraduates
- `P.Undergrad` : Number of part-time undergraduates
- `Outstate`  : Out-of-state tuition
- `Room.Board` : Room and board costs
- `Books` : Estimated book costs
- `Personal` : Estimated personal spending
- `PhD` : Percent of faculty with Ph.D.’s
- `Terminal` : Percent of faculty with terminal degree
- `S.F.Ratio` : Student/faculty ratio
- `perc.alumni` : Percent of alumni who donate
- `Expend` : Instructional expenditure per student
- `Grad.Rate` : Graduation rate


Use the `read_csv()` function in *pandas* library to read the data. Make sure that you have the directory set to the correct location for the data. Note that if you’re not interested in downloading the dataset (which would actually make your submission more reproducible), you could simply read the dataset directly from the website using `read_csv()`. The dataset can be found in the following link: https://book.huihoo.com/introduction-to-statistical-learning/College.csv. 

- Read dataset and save it as `college`. 

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Use the `describe()` function to produce a numerical summary of the variables in the data set.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Produce a scatterplot matrix of the first ten columns or variables of the data using the `pandas.plotting.scatter_matrix()` function or another suitable method.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Create side-by-side boxplots of `Outstate` versus `Private` using `matplotlib.pyplot` functions or another suitable method.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Create a new binary variable, called `Elite`, by binning the `Top10perc` variable to `college.` We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Use the `describe()` function or another suitable method to determine the number of elite universities. Then, create side-by-side boxplots of `Outstate` versus `Elite`.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Produce some histograms with diﬀering numbers of bins for a few of the quantitative variables. Use any functions you find suitable.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Continue exploring the data, and provide a brief summary of what you discover. Some quick ideas: (1) explore graduation rates and briefly talk about what might be driving differences between universities. (2) Compare instructional expenditure per student and graduation rates.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Are there any other features, not taken into account in the dataset, that you think might be important for understanding whether a school is classified as elite or not?

In [None]:
# BEGIN SOLUTION

# END SOLUTION

#### Question 8

This exercise involves the `auto` data, which can be found in the file `Auto.data`. Read the data and remove missing values. 

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Which of the predictors are quantitative, and which are qualitative?
> **_Answer:_**  [BEGIN SOLUTION].

- What is the range of each quantitative predictor?

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- What is the mean and standard deviation of each quantitative predictor?

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Now remove the 10th __through__ 80th observations from `auto`. Save this as a new object called `auto2`. What is the range, mean, and standard deviation of each predictor in `auto2`?

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Using the full data set (`auto`), investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Briefly comment on your findings.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

> **_Answer:_**  [BEGIN SOLUTION].

- Suppose that we were interested in predicting gas mileage (`mpg`) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting `mpg`? Justify your answer.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

> **_Answer:_**  [BEGIN SOLUTION].

#### Question 9

- This exercise involves the `Boston` housing data set. To begin, load in the `Boston` data set from `Boston.csv'.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- How many rows are in this data set? How many columns? What do the rows and columns represent?

In [None]:
# BEGIN SOLUTION

# END SOLUTION

> **_Answer:_**  [BEGIN SOLUTION].

- Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings. Note that there’s a ton of things that you can comment on. Please select only one or two aspects that you think are relevant.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

> **_Answer:_**  [BEGIN SOLUTION].

- Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

> **_Answer:_**  [BEGIN SOLUTION].

- Do any of the census tracts of `Boston` appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

> **_Answer:_**  [BEGIN SOLUTION].

- How many of the census tracts in this data set bound the Charles river?

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- What is the median pupil-teacher ratio among the towns in this data set?

In [None]:
# BEGIN SOLUTION

# END SOLUTION

- Which census tract of `Boston` has lowest median value of owner-occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

> **_Answer:_**  [BEGIN SOLUTION].

- In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

In [None]:
# BEGIN SOLUTION

# END SOLUTION

> **_Answer:_**  [BEGIN SOLUTION].

#### Question 10

Re-assessing the `Boston` dataset

- The `Boston` dataset is now part of the standard training sets implemented in many libraries and packages across different languages (e.g. `scikit-learn` ["recently" deprecated] and `Tensorflow` in `Python`; `MASS` in `R`). Did any of the questions (or your answers) listed above raise any red flags about the dataset?

> **_Answer:_**  [BEGIN SOLUTION].

- This dataset, originally published in _Harrison and Rubinfeld (1978)_, was primarily compiled to examine the effects of environmental factors in driving spatial patterns of the housing market. To date, different packages have implemented alternative versions of the features sampled in the dataset. Please compare the features (and their descriptions) in Table IV from the original study (_Harrison and Rubinfeld, 1978_) to the features used in the Book’s dataset (the one we used in this homework). If the two datasets are not equivalent in their features, briefly (1) discuss the potential motives behind these discrepancies, and (2) comment on whether dropping particular columns was the right solution? Feel free to comment on the potential differences (conceptual and social) associated with training models in each of these two datasets.

Harrison Jr, D., & Rubinfeld, D. L. (1978). _Hedonic housing prices and the demand for clean air. Journal of environmental economics and management_, 5(1), 81-102.

> **_Answer:_**  [BEGIN SOLUTION].