# Assignment 1
### Understanding Uncertainty
### Due 9/5

1. Create a new public repo on Github under your account. Include a readme file.
2. Clone it to your machine. Put this file into that repo.
3. Use the following function to download the example data for the course:

In [2]:
def download_data(force=False):
    """Download and extract course data from Zenodo."""
    import urllib.request, zipfile, os
    
    zip_path = 'data.zip'
    data_dir = 'data'
    
    if not os.path.exists(zip_path) or force:
        print("Downloading course data")
        urllib.request.urlretrieve(
            'https://zenodo.org/records/16954427/files/data.zip?download=1',
            zip_path
        )
        print("Download complete")
    else:
        print("Download file already exists")
        
    if not os.path.exists(data_dir) or force:
        print("Extracting data files...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
        print("Data extracted")
    else:
        print("Data directory already exists")

download_data()

Downloading course data
Download complete
Extracting data files...
Data extracted


4. Open one of the datasets using Pandas:
    1. `ames_prices.csv`: Housing characteristics and prices
    2. `college_completion.csv`: Public, nonprofit, and for-profit educational institutions, graduation rates, and financial aid
    3. `ForeignGifts_edu.csv`: Monetary and in-kind transfers from foreign entities to U.S. educational institutions
    4. `iowa.csv`: Liquor sales in Iowa, at the transaction level
    5. `metabric.csv`: Cancer patient and outcome data
    6. `mn_police_use_of_force.csv`: Records of physical altercations between Minnessota police and private citizens
    7. `nhanes_data_17_18.csv`: National Health and Nutrition Examination Survey
    8. `tuna.csv`: Yellowfin Tuna Genome (I don't recommend this one; it's just a sequence of G, C, A, T )
    9. `va_procurement.csv`: Public spending by the state of Virginia

In [None]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv('data/ames_prices.csv')
df.head()

Unnamed: 0,Order,PID,area,price,MS.SubClass,MS.Zoning,Lot.Frontage,Lot.Area,Street,Alley,...,Screen.Porch,Pool.Area,Pool.QC,Fence,Misc.Feature,Misc.Val,Mo.Sold,Yr.Sold,Sale.Type,Sale.Condition
0,1,526301100,1656,215000,20,RL,141.0,31770,Pave,,...,0,0,,,,0,5,2010,WD,Normal
1,2,526350040,896,105000,20,RH,80.0,11622,Pave,,...,120,0,,MnPrv,,0,6,2010,WD,Normal
2,3,526351010,1329,172000,20,RL,81.0,14267,Pave,,...,0,0,,,Gar2,12500,6,2010,WD,Normal
3,4,526353030,2110,244000,20,RL,93.0,11160,Pave,,...,0,0,,,,0,4,2010,WD,Normal
4,5,527105010,1629,189900,60,RL,74.0,13830,Pave,,...,0,0,,MnPrv,,0,3,2010,WD,Normal


5. Pick two or three variables and briefly analyze them
    - Is it a categorical or numeric variable?
    - How many missing values are there? (`df['var'].isna()` and `np.sum()`)
    - If categorical, tabulate the values (`df['var'].value_counts()`) and if numeric, get a summary (`df['var'].describe()`)

##### PID Variable

In [None]:
# Get Variable Type
df['PID'].dtype

dtype('int64')

PID is definitely an integer value, even though it is genrally treated as a classifier. 

In [None]:
# Find the sum of missing values
np.sum(df['PID'].isna())

np.int64(0)

The `PID` Variable has no missing values.

In [9]:
# Get Tabular Summary
df['PID'].describe()

count    2.930000e+03
mean     7.144645e+08
std      1.887308e+08
min      5.263011e+08
25%      5.284770e+08
50%      5.354536e+08
75%      9.071811e+08
max      1.007100e+09
Name: PID, dtype: float64

##### Heating Variable

In [11]:
# Get Variable Type
df['Heating'].dtype

dtype('O')

This means that `Heating` is a categorical variable.

In [12]:
# Find the sum of missing values
np.sum(df['Heating'].isna())

np.int64(0)

The `PID` Variable has no missing values.

In [13]:
df['Heating'].value_counts()

Heating
GasA     2885
GasW       27
Grav        9
Wall        6
OthW        2
Floor       1
Name: count, dtype: int64

##### Lot Frontage Variable

In [14]:
# Get Variable Type
df['Lot.Frontage'].dtype

dtype('float64')

The `Lot.Frontage` variable is a float type, since there are decimal points in the variable.

In [15]:
# Find the sum of missing values
np.sum(df['Lot.Frontage'].isna())

np.int64(490)

This variable has 490 missing cases out of the 2930 rows. 

In [17]:
# Get Tabular Summary
df['Lot.Frontage'].describe()

count    2440.000000
mean       69.224590
std        23.365335
min        21.000000
25%        58.000000
50%        68.000000
75%        80.000000
max       313.000000
Name: Lot.Frontage, dtype: float64

### Empricical cdf of the response variable.

6. What are some questions and prediction tools you could create using these data? Who would the stakeholder be for that prediction tool? What practical or ethical questions would it create? What other data would you want, that are not available in your data?

I think that one of the most intruiging quesitons you could answer using this data set would be: *What would be the estimated value of my home given its characteristics?*

In order to solve this question, we could create a prediction tool, something like Gradient Boosting for Regression, to estimate the predicted value of a home with its given conditions, specs, ammenities, etc. This tool would be extremely helpful for particular stakeolders that are trying to sell their home and want to get a realistic, statistical gauge of how much their home should be evaluated at. It could also be helpful for buyers to get an unbiased prespective on the true value of a house considering the factors that the buyer actually cares about.

As for the questions and concerns that this tool could pose, we would need to ensure that if the tool is being used from the seller's perspective, then we do not include any information that the seller does not want to be included that could be too personal or sensitive. We would also need to be particularly aware about potential multicolinearity between variables in order to reduce redundancy and underlying biases.

In addiiton to the neighborhood, I would also like a wider scope of the location: county, city, state, or even country. The scope would depend on how far apart these properties are from each other. If all of these neighborhoods were within a localized vicinity, then these extra variables would not provide any value or context to the overall analysis.

7. Commit your work to the repo (`git commit -am 'Finish assignment'` at the command line, or use the Git panel in VS Code). Push your work back to Github and submit the link on Canvas in the assignment tab.