##  Python Data Science

> Introduction to Machine Learning

Kuo, Yao-Jen <yaojenkuo@datainpoint.com> from [DATAINPOINT](https://www.datainpoint.com)

In [1]:
import numpy as np
import pandas as pd
import sklearn.linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

## Given `house-prices-train.csv` in working directory, extract `GrLivArea` as the feature matrix, `SalePrice` as the target vector. Split both target vector and feature matrix with 80% as training set, 20% as validation set. Use a hyperparameter `random_state=42` to fix the randomness of split. Apply `LinearRegression` on the training set to generate a fitted model. Use the model to predict the `SalePrice` for validation set and measure the mean squared error of the model.

- Expected inputs: a CSV file `house-prices-train.csv`.
- Expected outputs: a float.

In [2]:
def get_model_mse(csv_file):
    """
    >>> model_mse = get_model_mse('house-prices-train.csv')
    >>> print(type(model_mse))
    <class 'float'>
    >>> print(model_mse)
    3418946311.180807
    """
    ### BEGIN SOLUTION
    df = pd.read_csv(csv_file)
    X = df['GrLivArea'].values.reshape(-1, 1)
    y = df['SalePrice'].values
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_hat = model.predict(X_valid)
    return mean_squared_error(y_valid, y_hat)
    ### END SOLUTION

## Given `titanic-train.csv` in working directory, extract `Fare` as the feature matrix, `Survived` as the target vector. Split both target vector and feature matrix with 80% as training set, 20% as validation set. Use a hyperparameter `random_state=42` to fix the randomness of split. Apply `LogisticRegression` on the training set to generate a fitted model. Use the model to predict the `Survived` for validation set and measure the accuracy of the model.

- Expected inputs: a CSV file `titanic-train.csv`.
- Expected outputs: a float.

In [3]:
def get_model_accuracy_score(csv_file):
    """
    >>> model_accuracy_score = get_model_accuracy_score('titanic-train.csv')
    >>> print(type(model_accuracy_score))
    <class 'float'>
    >>> print(model_accuracy_score)
    0.6536312849162011
    """
    ### BEGIN SOLUTION
    df = pd.read_csv(csv_file)
    X = df['Fare'].values.reshape(-1, 1)
    y = df['Survived'].values
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_hat = model.predict(X_valid)
    return accuracy_score(y_valid, y_hat)
    ### END SOLUTION

## Run tests!

Kernel -> Restart & Run All.

In [4]:
import unittest

class TestValidateModelPerformance(unittest.TestCase):
    def test_get_model_mse(self):
        model_mse = get_model_mse('house-prices-train.csv')
        self.assertIsInstance(model_mse, float)
        self.assertAlmostEqual(model_mse, 3418946311.180807)
    def test_get_model_accuracy_score(self):
        model_accuracy_score = get_model_accuracy_score('titanic-train.csv')
        self.assertIsInstance(model_accuracy_score, float)
        self.assertAlmostEqual(model_accuracy_score, 0.6536312849162011)    
    
suite = unittest.TestLoader().loadTestsFromTestCase(TestValidateModelPerformance)
runner = unittest.TextTestRunner(verbosity=2)
test_results = runner.run(suite)
number_of_failures = len(test_results.failures)
number_of_errors = len(test_results.errors)
number_of_test_runs = test_results.testsRun
number_of_successes = number_of_test_runs - (number_of_failures + number_of_errors)
total_points = number_of_successes * 2

test_get_model_accuracy_score (__main__.TestValidateModelPerformance) ... ok
test_get_model_mse (__main__.TestValidateModelPerformance) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.057s

OK


In [5]:
print("You've got {} successes out of {} exercises.".format(number_of_successes, number_of_test_runs))

You've got 2 successes out of 2 exercises.
