# Module 7 Assignment


A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without an error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [1]:
import pandas as pd
from nose.tools import assert_equal, assert_true


# Problem 1: Read in a dataset

For this problem you will read in a dataset from a **csv** file using Pandas. In the cell below, the function *read_data* has argument "file_path" which contains a path to a dataset.
- Use the *read_csv* function from Pandas to read in the dataset from the file path and return the resulting DataFrame.

In [2]:
def read_data(file_path):
    '''
    Reads in a dataset using pandas.
    
    Parameters
    ----------
    file_path : string containing path to a file
    
    Returns
    -------
    Pandas DataFrame with data read in from the file path
    '''

    # YOUR CODE HERE
    
    df = pd.read_csv(file_path)
    return df

In [3]:
df = read_data('data/iris_data.csv')
df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,iris type
0,5.6,2.9,3.6,1.3,Iris-versicolor
1,6.3,2.7,4.9,1.8,Iris-virginica
2,7.7,3.0,6.1,2.3,Iris-virginica
3,5.4,3.9,1.3,0.4,Iris-setosa
4,5.0,3.5,1.3,0.3,Iris-setosa


In [4]:
assert_equal(len(df), 110, msg="The dataset should have 110 rows. Your solution only has %s"%len(df))
assert_equal(set(df.columns.tolist()), set(['sepal length', 'sepal width', 'petal length',
       'petal width', 'iris type']), 
             msg="Your column names do not match the solutions")

# Problem 2: Fix column names

In this problem you will fix the column names of the DataFrame loaded from Problem 1 so that there's no whitespaces in the column names. Use '-' to connect words. For example, "sepal length" should become "sepal_length"

- Directly work on **df** created from Problem 1
- Fix all column names so that whitespaces are replaced by '_'

In [5]:
# YOUR CODE HERE
df.columns = ['sepal_length','sepal_width','petal_length','petal_width','iris_type']

In [6]:
assert_true('sepal_length' in df.columns, "Column name is not fixed as directed")
assert_true('sepal_width' in df.columns, "Column name is not fixed as directed")
assert_true('petal_length' in df.columns, "Column name is not fixed as directed")
assert_true('petal_width' in df.columns, "Column name is not fixed as directed")
assert_true('iris_type' in df.columns, "Column name is not fixed as directed")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,iris_type
0,5.6,2.9,3.6,1.3,Iris-versicolor
1,6.3,2.7,4.9,1.8,Iris-virginica
2,7.7,3.0,6.1,2.3,Iris-virginica
3,5.4,3.9,1.3,0.4,Iris-setosa
4,5.0,3.5,1.3,0.3,Iris-setosa


# Problem 3: Drop missing values

In this problem you will drop all rows with missing values from the DataFrame **df**.

- Directly work on **df** created from Problem 1 and fixed in Problem 2
- Drop all rows with missing values
- After this problem, there should be no missing values in **df**

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  109 non-null    float64
 1   sepal_width   109 non-null    float64
 2   petal_length  109 non-null    float64
 3   petal_width   110 non-null    float64
 4   iris_type     110 non-null    object 
dtypes: float64(4), object(1)
memory usage: 4.4+ KB


In [8]:
# YOUR CODE HERE
df.dropna(inplace=True)

In [9]:
assert_equal(df.shape[0], 107, "df doesn't have correct values")
assert_equal(df.sepal_length.isnull().sum(), 0, "sepal_length column has missing values")
assert_equal(df.sepal_width.isnull().sum(), 0, "sepal_width column has missing values")
assert_equal(df.petal_length.isnull().sum(), 0, "petal_length column has missing values")
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 0 to 109
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  107 non-null    float64
 1   sepal_width   107 non-null    float64
 2   petal_length  107 non-null    float64
 3   petal_width   107 non-null    float64
 4   iris_type     107 non-null    object 
dtypes: float64(4), object(1)
memory usage: 5.0+ KB


# Problem 4: Create linear regression formula

In this problem you will create a statsmodels regression formula to predict petal_width with other columns in the DataFrame **df** fixed by Problem 3.

- Work with columns in **df** created from Problem 1 and fixed in Problems 2 & 3
- **petal_width** will be the dependent variable
- All other columns are independent variables
- Enclose categorical feature with "C()"
- Assign the formula string to variable **formula**

In [10]:
# YOUR CODE HERE
import statsmodels.formula.api as smf

formula = smf.ols(formula="petal_width ~ sepal_length + sepal_width + petal_length", data=df)

In [11]:
assert_true('petal_width' in formula.split('~')[0], "Dependent variable is wrong")
assert_true('sepal_length' in formula.split('~')[1], "sepal_length should be independent variable")
assert_true('sepal_width' in formula.split('~')[1], "sepal_width should be independent variable")
assert_true('petal_length' in formula.split('~')[1], "petal_length should be independent variable")
assert_true('C(iris_type)' in formula.split('~')[1], "Categorical variable is not enclosed in C()")


AttributeError: 'OLS' object has no attribute 'split'