# Columbus Data Science Meetup
# Regression With Python, Data Sets
## Charles Carter
### November 8, 2022

## Getting the data

Most of the time, we have our own proprietary data to use. When learning regression with Python,
or when giving a public talk, we cannot use proprietary data. Fortunately, we have a number of
data set available natively in Python, in addition to public data sets and those on
sites like Kaggle.

The Python library *statsmodels* provides a number of datasets. You can find these at:

https://www.statsmodels.org/dev/datasets/index.html
    
**R** has influenced the regression libraries in Python, and many **R** data sets can be inported
into Python. You can find these at:
    
https://github.com/vincentarelbundock/Rdatasets

Let's explore two data sets, that you may have used: the Boston housing data and the German credit data.

## Importing libraries

We need Pandas to manipulate the data, and statsmodels to retrieve the data.

In [1]:
import pandas as pd
import statsmodels.api as sm

## Housing data (Boston)

The Boston housing data is in the *Modern Applied Statistics with S* package. We load the data
and print the dataset documentation.

In [2]:
boston = sm.datasets.get_rdataset("Boston", "MASS")
print(boston.__doc__)

.. container::

   Boston R Documentation

   .. rubric:: Housing Values in Suburbs of Boston
      :name: housing-values-in-suburbs-of-boston

   .. rubric:: Description
      :name: description

   The ``Boston`` data frame has 506 rows and 14 columns.

   .. rubric:: Usage
      :name: usage

   ::

      Boston

   .. rubric:: Format
      :name: format

   This data frame contains the following columns:

   ``crim``
      per capita crime rate by town.

   ``zn``
      proportion of residential land zoned for lots over 25,000 sq.ft.

   ``indus``
      proportion of non-retail business acres per town.

   ``chas``
      Charles River dummy variable (= 1 if tract bounds river; 0
      otherwise).

   ``nox``
      nitrogen oxides concentration (parts per 10 million).

   ``rm``
      average number of rooms per dwelling.

   ``age``
      proportion of owner-occupied units built prior to 1940.

   ``dis``
      weighted mean of distances to five Boston employment centres.

   ``rad

## Looking at the data

The *info()* method lists the fields. The *head()* method prints the first few rows of the data.

In [3]:
boston.data.info()
boston.data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  black    506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


## Using a Pandas DataFrame

When using Python and Pandas, in order to take advantage of the functionality, we can convert
the loaded data set to a Pandas DataFrame. We can then use the *describr()* method to examine
certain tendencies of each field.

In [4]:
bostondf = boston.data
bostondf.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677082,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


## Saving the data

Let's save our DataFrame as a CSV file so we can load it from our local system.

In [5]:
bostondf.to_csv("myBoston.csv")
import os, fnmatch
#os.listdir('.') ## oops, TMI
fnmatch.filter(os.listdir('.'), '*.csv')

['employee_salaries.csv', 'ht-wt-sex-data.csv', 'myBoston.csv', 'myCredit.csv']

## Credit data

Let's do the same with the credit data, but we need to manipulate it a little more.

In [6]:
credit = sm.datasets.get_rdataset("Credit", "ISLR")
print(credit.__doc__)

.. container::

   Credit R Documentation

   .. rubric:: Credit Card Balance Data
      :name: credit-card-balance-data

   .. rubric:: Description
      :name: description

   A simulated data set containing information on ten thousand
   customers. The aim here is to predict which customers will default on
   their credit card debt.

   .. rubric:: Usage
      :name: usage

   ::

      Credit

   .. rubric:: Format
      :name: format

   A data frame with 10000 observations on the following 4 variables.

   ``ID``
      Identification

   ``Income``
      Income in $1,000's

   ``Limit``
      Credit limit

   ``Rating``
      Credit rating

   ``Cards``
      Number of credit cards

   ``Age``
      Age in years

   ``Education``
      Number of years of education

   ``Gender``
      A factor with levels ``Male`` and ``Female``

   ``Student``
      A factor with levels ``No`` and ``Yes`` indicating whether the
      individual was a student

   ``Married``
      A factor with l

In [7]:
credit.data.info()
credit.data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         400 non-null    int64  
 1   Income     400 non-null    float64
 2   Limit      400 non-null    int64  
 3   Rating     400 non-null    int64  
 4   Cards      400 non-null    int64  
 5   Age        400 non-null    int64  
 6   Education  400 non-null    int64  
 7   Gender     400 non-null    object 
 8   Student    400 non-null    object 
 9   Married    400 non-null    object 
 10  Ethnicity  400 non-null    object 
 11  Balance    400 non-null    int64  
dtypes: float64(1), int64(7), object(4)
memory usage: 37.6+ KB


Unnamed: 0,ID,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
1,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
2,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
3,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
4,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331


## Resetting tbe index

This file includes both an index field and an index, which is redundant. We need to 
just use one of them.

In [8]:
print("The credit data is not a DataFrame, but a", type(credit))
print("However, the DATA attribute is a DataFrame:", type(credit.data))
#print(type(credit.data))

creditdf = credit.data
creditdf.set_index('ID', inplace = True)
creditdf.head()

The credit data is not a DataFrame, but a <class 'statsmodels.datasets.utils.Dataset'>
However, the DATA attribute is a DataFrame: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331


In [9]:
creditdf.to_csv("myCredit.csv")
fnmatch.filter(os.listdir('.'), '*.csv')

['employee_salaries.csv', 'ht-wt-sex-data.csv', 'myBoston.csv', 'myCredit.csv']