#(a) Linear Regression 

We are given data used in a study of the homicide rate (HOM) in Detroit, over the years 1961-1973. The following data were collected by J.C. Fisher, and used in his paper ”Homicide in Detroit: The Role of Firearms,” Criminology, vol. 14, pp. 387-400, 1976. Each row is for a year, and each column are values of a variable.

![image](https://peilundai.com/ps2_programming/table.png)

It turns out that three of the variables together are good predictors of the homicide rate: `FTP`, `WE`, and one more variable.
Use methods described in Chapter 3 of the textbook to devise a mathematical formulation to determine the third variable. Implement your formulation and then conduct experiments to determine the third variable. In your report, be sure to provide the step-by-step mathematical formulation (citing Chapter 3 as needed) that corresponds to the implementation you turn in. Also give plots and a rigorous argument to justify the scheme you use and your conclusions.

**Note**: the file `detroit.npy` containing the data is given on the resources section of our course Piazza. To load the data into Python, use `X=numpy.load(‘detroit.npy’)` command. Least-squares linear regression in Python can be done with the help of `numpy.linalg.lstsq()`.

**Your answer:**

Type your step-by-step mathematical formualtion (citing chapter 3 as needed)

In [1]:
# download data 
!wget https://peilundai.com/ps2_programming/detroit.npy

--2021-06-18 09:40:44--  https://peilundai.com/ps2_programming/detroit.npy
Resolving peilundai.com (peilundai.com)... 185.199.109.153, 185.199.108.153, 185.199.111.153, ...
Connecting to peilundai.com (peilundai.com)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1120 (1.1K) [application/octet-stream]
Saving to: ‘detroit.npy’


2021-06-18 09:40:44 (60.5 MB/s) - ‘detroit.npy’ saved [1120/1120]



In [2]:
# load data
import numpy as np
import pandas as pd
X=np.load('detroit.npy')
#print(X.shape)
#X

# Note: Least-squares linear regression in Python can be done with the help of np.linalg.lstsq()
## YOUR CODE

## This method uses the maximum likelihood of least squares.

## First, I find the linear regressions that will match the given inputs to the outputs,
## with only three input variables - FTP, and WE which are known parameters,
## and switching out the rest of the parameters in each of the input set.

## Second, I find the regression that produces the least squares difference
## with the actual output, B, to find the parameter that produces the best answer.

array = X
columns = []
for col in range(len(array[0])):
  column = []
  for row in range(len(array)):
    column += [array[row][col]]
  columns += [column]

## Storing FTP and WE parameters as part of the A matrix.
## Also storing the output as its own column.
inputs = columns[:-1]
outputs = columns[-1]
B = np.array(outputs)
ftp = inputs[0]
we = inputs[-1]
vars = inputs[1:-1]

## Finding the weights of the linear regression which
## trys best to match the seleted inputs to the given outputs
keys = ["UEMP", "MAN", "LIC", "GR", "NMAN", "GOV", "HE"]
x_solved = []
for i in range(len(vars)):
  potential_var = vars[i]
  a = [] + [ftp] + [we] + [potential_var]
  A = np.transpose(np.array(a))
  # print(A.shape)
  # print(B.shape)
  x = np.linalg.lstsq(A, B, rcond=None)
  x_solved += [np.matmul(A, x[0])]

  ### Just some printing stuff for the console
  for ex in range(len(x)):
    #print(x[ex])
    if ex==0:
      #print(A.shape)
      #print(x[ex].shape)
      print("For weight x = " + str(x[ex]) + "......")
      print("A_("+ keys[i] + ") * x = " )
      print(np.matmul(A, x[ex]))
  print(x)
  print()

solutions = {keys[k] : x_solved[k] for k in range(len(keys))}
#print(solutions)

errors = []
for vector in solutions:
  x_possible = solutions[vector]
  sqrd_error = 0
  for i in range(len(x_possible)):
    sqrd_error += (B[i] - x_possible[i]) * (B[i] - x_possible[i])
  errors += [np.sqrt(sqrd_error)]
  #print(errors)
    
print("ROOT OF SQUARES SUM Dictionary: ")
final = [(sqrd_error, key) for (sqrd_error, key) in zip(errors, keys)]
print(final)

print()
print()
print()
print("ANSWER: ")
print(">>> " + min(final)[1])

### IRRELEVANT ###
#print(solutions)
# print("A: ")
# print(A)
# print(A.shape)
# print("B: ")
# print(B)
# print(B.shape)

# print(A)
# print(np.transpose(A))
# print(B)
# C = A * [1, 2, 3]
# print(ftp)
# print(we)
# print(vars)
# print(inputs)
# print(outputs)

For weight x = [-0.11350636  0.3526372   0.13353969]......
A_(UEMP) * x = 
[13.23958351 17.57120016 19.77777546 21.77477827 25.90482798 26.19461714
 24.78781917 13.38400913 27.20383756 25.068869   34.53810043 42.92210591
 47.55028416]
(array([-0.11350636,  0.3526372 ,  0.13353969]), array([1264.66427239]), 3, array([1274.9379536,   63.1669585,    7.3451018]))

For weight x = [ 0.09892971  0.23961867 -0.08129228]......
A_(MAN) * x = 
[16.80623303 19.76837788 19.71998944 18.90622175 18.43802741 14.60638538
 16.88156078 12.32860388 24.60127467 30.22188564 40.87383885 48.98622427
 50.90359583]
(array([ 0.09892971,  0.23961867, -0.08129228]), array([839.56672009]), 3, array([2377.99804714,  162.30134421,   52.90153266]))

For weight x = [-0.18235541  0.38458568  0.02948526]......
A_(LIC) * x = 
[ 2.84231846  6.95447318 10.71880501 13.68393286 20.6845382  24.33148384
 30.31297371 30.04780616 35.10767319 29.74785233 39.67124432 40.82325248
 49.01152318]
(array([-0.18235541,  0.38458568,  0.02

# (b) k-Nearest Neighbors

For this problem, you will be implementing the k-Nearest Neighbor (k-NN) classifier and evaluating on the `Credit Approval` (CA) dataset. It describes credit worthiness data (in this case, binary classification). (see http://archive.ics.uci.edu/ml/datasets/Credit+Approval) We have split the available data into a training set `crx.data.training` and a testing set `crx.data.testing`. These are both comma-separated text files (CSVs). 

The first step to working with the CA dataset is to process the data. In looking at the data description `crx.names`, note that there are some missing values, there exist both numerical and categorical features, and that it is a relatively balanced dataset (meaning a roughly equal number of positive and negative examples - not that you should particularly care in this case, but something you should look for in general). A great Python library for handling data like this is Pandas (https://pandas.pydata.org/pandas-docs/stable/). You can read in the data with `X = pandas.read csv(‘crx.data.training’, header=None, na values=‘?’)`. The last option tells Pandas to treat the character `?` as a missing value. 

Pandas holds data in a "dataframe". We'll deal with individual rows and columns, which Pandas calls "series". Pandas contains many convenient tools, bu the most basic you'll use is `X.iloc[i,j]`, accessing the element in the i-th row and j-th column. You can use this for both getting and setting values. You can also slice like normal Python, grabbing the i-th row with `[i,:]`. 

You can view the first 20 rows with `X.head(20)`. The last column, number 15, contains the labels. You’ll see some elements are missing, marked with `NaN`. While there are more sophisticated (and better) methods for imputing missing values, for this assign- ment, we will just use mean/mode imputation. This means that for feature 0, you should replace all of the question marks with a `b` as this is the mode, the most common value (regardless if you condition on the label or not). For real-valued features, just replace missing values with the label-conditioned mean (e.g. $μ(x_1|+)$ for instances labeled as positive).

The second aspect one should consider is normalizing features. Nominal features can be left in their given form where we define the distance to be a constant value (e.g. 1) if they are different values, and 0 if they are the same. However, it is often wise to normalize real-valued features. For the purpose of this assignment, we will use $z$-scaling, where

$$z_{i}^{(m)} \leftarrow \frac{x_{i}^{(m)}-\mu_{i}}{\sigma_{i}}$$

such that $z(m)$ indicates feature $i$ for instance $m$ (similarly $x(m)$ is the raw input), $μ_i$ is
the average value of feature $i$ over all instances, and $σ_i$ is the corresponding standard deviation over all instances.

In this notebook, include the following functions:

i. A function `impute_missing_data()` that accepts two Pandas dataframes, one training and one testing, and returns two dataframes with missing values filled in. In your report include your exact methods for each type of feature. Note that you are free to impute the values using statistics over the entire dataset (training and testing combined) or just training, but please state your method.

ii. A function normalize `features()` that accepts a training and testing dataframe and returns two dataframes with real-valued features normalized.

iii. A function `distance()` that accepts two rows of a dataframe and returns a float, the L2 distance: $D_{L2}(\mathbf{a},\mathbf{b}) = \sqrt{\sum_i (ai −bi)^2}$ . Note that we define $D_{L2}$ to have a component-wise value of 1 for categorical attribute-values that disagree and 0 if they do agree (as previously implied). Remember not to use the label column in your distance calculation!

iv. A funtion `predict()` that accepts three arguments: a training dataframe, a testing dataframe, and an integer $k$ - the number of nearest neighbors to use in predicting. This function should return a column of $+/-$ labels, one for every row in the testing data.

v. A function `accuracy()` that accepts two columns, one true labels and one predicted by your algorithm, and returns a float between 0 and 1, the fraction of labels you guessed correctly.

In your report, include accuracy results on `crx.data.testing` for at least three different values of `k`.

vi. Try your algorithm on some other data! We’ve included the “lenses” dataset (https://archive.ics.uci.edu/ml/datasets/Lenses). It has no missing values and only categorical attributes, so no need for imputation or normalization. Include accuracy results from `lenses.testing` in your report as well. 

The code you submit must be your own. If you find/use information about specific algorithms from the Web, etc., be sure to cite the source(s) clearly in your sourcecode. You are not allowed to submit code downloaded from the internet (obviously).

In [3]:
!wget https://peilundai.com/ps2_programming/credit.zip
!unzip credit.zip

--2021-06-18 09:40:44--  https://peilundai.com/ps2_programming/credit.zip
Resolving peilundai.com (peilundai.com)... 185.199.109.153, 185.199.108.153, 185.199.111.153, ...
Connecting to peilundai.com (peilundai.com)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34183 (33K) [application/zip]
Saving to: ‘credit.zip’


2021-06-18 09:40:44 (51.4 MB/s) - ‘credit.zip’ saved [34183/34183]

Archive:  credit.zip
   creating: credit/
  inflating: __MACOSX/._credit       
  inflating: credit/lenses.output    
  inflating: __MACOSX/credit/._lenses.output  
  inflating: credit/lenses.testing   
  inflating: __MACOSX/credit/._lenses.testing  
  inflating: credit/crx.data.training  
  inflating: __MACOSX/credit/._crx.data.training  
  inflating: credit/lenses.training  
  inflating: __MACOSX/credit/._lenses.training  
  inflating: credit/crx.names        
  inflating: __MACOSX/credit/._crx.names  
  inflating: credit/crx.data.testing  
  inflating: __MACOSX

**Your answer**
click to edit

In [4]:
### You code for question (b), create more cells as needed. 

import pandas as pd
import statistics as stat
from collections import Counter

training_set = pd.read_csv('credit/crx.data.training', header=None, na_values='?')

testing_set = pd.read_csv('credit/crx.data.testing', header=None, na_values='?')

training_set.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,,4.0,y,p,i,v,0.085,f,f,0,t,g,411.0,0,-
1,b,43.17,2.25,u,g,i,bb,0.75,t,f,0,f,g,560.0,0,-
2,a,71.58,0.0,,,,,0.0,f,f,0,f,p,,0,+
3,b,48.75,8.5,u,g,c,h,12.5,t,t,9,f,g,181.0,1655,+
4,a,38.33,4.415,u,g,c,v,0.125,f,f,0,f,g,160.0,0,-
5,b,39.92,5.0,u,g,i,bb,0.21,f,f,0,f,g,550.0,0,-
6,b,29.25,13.0,u,g,d,h,0.5,f,f,0,f,g,228.0,0,-
7,b,23.33,1.5,u,g,c,h,1.415,t,f,0,f,g,422.0,200,+
8,a,47.42,3.0,u,g,x,v,13.875,t,t,2,t,g,519.0,1704,+
9,a,21.92,11.665,u,g,k,h,0.085,f,f,0,f,g,320.0,5,-


In [5]:
def impute_missing_data(training_set, testing_set):
  trn_s = impute_features(training_set)
  tst_s = impute_features(testing_set)
  return (trn_s, tst_s)

def impute_features(input_dataset):
  dataset = input_dataset.copy()
  count = 0
  for i in dataset:
    no_na_column = dataset[i].dropna().values
    counter = Counter(no_na_column)
    mode = counter.most_common(1)[0][0]
    na_map_list = list(dataset[i].isna().values)
    for b in range(len(na_map_list)):
      if na_map_list[b] == True:
        count +=1
        dataset.__getitem__(i).__setitem__(b, mode)
        #dataset.iloc[i][b] = mode
  return dataset.copy()

In [6]:
def normalize_features(train, test):
  trn = train.copy()
  tst = test.copy()
  return (z_scaling(trn), z_scaling(tst))

def z_scaling(dataset):
  just_numerical = dataset.select_dtypes(include='number')
  for column in just_numerical.columns:
    dataset[column] = (dataset[column] - dataset[column].mean()) / dataset[column].std()
  return dataset

In [7]:
def distance(row1, row2):
  #dist = np.sqrt(np.sum([(a-b)*(a-b) for a, b in zip(x, y)]))
  d = 0
  #count = 1
  for a, b in zip(row1, row2):
    #print(str(count))
    try:
      c = a - b
      d += c * c
      #print("good: " + str(a) + " - " + str(b) + " = " + str(c))
      #print((c*c))
    except:
      #print("unsubtractable, " + str(a) + " " + str(b))
      if str(a) != str(b):
        d += 1
        #print("different -> 1")
      else:
        #print("same -> 0")
        continue
    #finally:
      #print("tried and did\n")
      #count += 1
      #continue
  return np.sqrt(d)

In [8]:
def predict(training_set, testing_set, k_nearest):
  predictions = []
  for i in range(int(testing_set.shape[0])):
    recorded_vals = []
    rowTest = testing_set.loc[i].values
    for j in range(int(training_set.shape[0])):
      rowTrain = training_set.loc[j].values
      dist = distance(rowTrain, rowTest)
      val = rowTrain[-1]
      recorded_vals += [(dist, val)]
    topk_outcomes = [label for (dist, label) in sorted(recorded_vals)[:k_nearest]]
    labels = set(topk_outcomes)
    outcomes = []
    for label in labels:
      count = topk_outcomes.count(label)
      outcomes += [(count, label)]
    (_, outcome) = max(outcomes)
    predictions += [outcome]
  return predictions

In [9]:
def compare(training_set, testing_set, k):
  predicted = predict(training_set, testing_set, k)
  actual = list(testing_set.iloc[:, -1].values)
  comparison = pd.DataFrame(list(zip(predicted, actual)))
  print(comparison.head())
  return (predicted, actual)

def accuracy(predicted, actual):
  correct = 0
  for i in range(len(predicted)):
    if predicted[i] == actual[i]:
      correct += 1
  return correct / len(predicted)

def run_test(trnS, tstS, k, treatment=False):
  if treatment:
    (trnS, tstS) = impute_missing_data(trnS, tstS)
    (trnS, tstS) = normalize_features(trnS, tstS)
  (pred, act) = compare(trnS, tstS, k)
  acc = accuracy(pred, act)
  print("For k = " + str(k) + ",")
  print("accuracy = " + str(acc))


### TESTING ###
print("([--{ CRX DATASET }--])")
run_test(trnS=training_set, tstS=testing_set, k=1, treatment=True)
print()
run_test(trnS=training_set, tstS=testing_set, k=3, treatment=True)
print()
run_test(trnS=training_set, tstS=testing_set, k=5, treatment=True)
print()
run_test(trnS=training_set, tstS=testing_set, k=7, treatment=True)
print()
run_test(trnS=training_set, tstS=testing_set, k=9, treatment=True)
print()
run_test(trnS=training_set, tstS=testing_set, k=11, treatment=True)

([--{ CRX DATASET }--])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


   0  1
0  +  +
1  +  +
2  +  +
3  +  -
4  +  +
For k = 1,
accuracy = 0.9202898550724637



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


   0  1
0  +  +
1  +  +
2  +  +
3  +  -
4  +  +
For k = 3,
accuracy = 0.9347826086956522



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


   0  1
0  +  +
1  +  +
2  +  +
3  +  -
4  +  +
For k = 5,
accuracy = 0.9347826086956522



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


   0  1
0  +  +
1  +  +
2  +  +
3  -  -
4  +  +
For k = 7,
accuracy = 0.9492753623188406



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


   0  1
0  +  +
1  +  +
2  +  +
3  -  -
4  +  +
For k = 9,
accuracy = 0.9565217391304348



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


   0  1
0  +  +
1  +  +
2  +  +
3  -  -
4  +  +
For k = 11,
accuracy = 0.9492753623188406


In [10]:
lenses_training = pd.read_csv('credit/lenses.training', header=None)
lenses_testing = pd.read_csv('credit/lenses.testing', header=None)

### TESTING ###
print("-- LENSES DATASET --")
run_test(trnS=lenses_training, tstS=lenses_testing, k=1)
print()
run_test(trnS=lenses_training, tstS=lenses_testing, k=3)
print()
run_test(trnS=lenses_training, tstS=lenses_testing, k=5)
print()
run_test(trnS=lenses_training, tstS=lenses_testing, k=7)
print()
run_test(trnS=lenses_training, tstS=lenses_testing, k=9)

-- LENSES DATASET --
   0  1
0  3  3
1  1  1
2  3  3
3  2  2
4  3  3
For k = 1,
accuracy = 1.0

   0  1
0  3  3
1  1  1
2  3  3
3  2  2
4  3  3
For k = 3,
accuracy = 0.8333333333333334

   0  1
0  3  3
1  2  1
2  3  3
3  2  2
4  3  3
For k = 5,
accuracy = 0.6666666666666666

   0  1
0  3  3
1  2  1
2  3  3
3  2  2
4  3  3
For k = 7,
accuracy = 0.6666666666666666

   0  1
0  3  3
1  3  1
2  3  3
3  3  2
4  3  3
For k = 9,
accuracy = 0.5


In [11]:
#### RANDOM TESTING CODE (irrlev) ####


#(TrainS, TestS) = impute_missing_data(training_set, testing_set)

#TrainS.head(20)
#TestS.head(20)
#TrainS.dtypes

In [12]:
# t_temp = TrainS.copy()

# for c in t_just_numerical.columns:
#   t_temp[c] = 0 * t_just_numerical[c]

# t_temp.head(20)

In [13]:
# t_just_numerical = TrainS.select_dtypes(include='number')
# t_just_numerical

In [14]:
# (nmzd_training_set, nmzd_testing_set) = normalize_features(TrainS, TestS)

# nmzd_training_set.head(20)

In [15]:
# rows = nmzd_training_set.loc[:3]
# #print(type(rows))

# row1 = rows.loc[0].values
# #print(row1)
# row2 = rows.loc[1].values
# #print(row2)
# row3 = rows.loc[2].values
# row4 = rows.loc[3].values
# #print(row4[-1])

# #print(zip(row1, row2))

# #rows.head()

In [16]:
# print(type(55.0) == np.number)

# dist = distance(row1, row2)
# print(dist)

In [17]:
#crx_predicted = predict(nmzd_training_set, nmzd_testing_set, 7)\
#print(crx_predicted)

# k = 7 
# (predicted, actual) = compare(nmzd_training_set, nmzd_testing_set, k)

# final_accuracy = accuracy(predicted, actual)

# print()
# print("k : " + str(k) + " = " + str(final_accuracy))

# k = 9
# (predicted, actual) = compare(nmzd_training_set, nmzd_testing_set, k)

# final_accuracy2 = accuracy(predicted, actual)

# print()
# print("k : " + str(k) + " = " + str(final_accuracy2))

In [18]:
# print("lenses predictions: ")
# k = 5
# (pred, act) = compare(lenses_training, lenses_testing, k)
# print("@ k = " + str(k) + ", accuracy = " + str(accuracy(pred, act)))

In [19]:
# tedst = [(5, '5'), (4, '4'), (3, '3'), (1, '1'), (2, '2'), (2, '2')]

# print(tedst)

# print(tedst[:4])

#print(nmzd_testing_set.head(10))
# print(sorted(set(tedst)))

# nums = [num for (num, number) in set(tedst)]

# print(nums)

# print(max(tedst))

#print(nmzd_testing_set.iloc[:, -1])