# Practice Loading and Exploring Datasets

This assignment is purposely open-ended. You will be asked to load datasets from the [UC-Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). 

Even though you maybe using different datasets than your fellow classmates, try and be supportive and assist each other in the challenges that you are facing. You will only deepend your understanding of these topics as you work to assist one another. Many popular UCI datasets face similar data loading challenges.

Remember that the UCI datasets do not necessarily have a file type of `.csv` so it's important that you learn as much as you can about the dataset before you try and load it. See if you can look at the raw text of the file either locally or using the `!curl` shell command -or in some other way before you try and read it in as a dataframe. This will help you catch what would otherwise be unforseen problems.

Feel free to embellish this notebook with additional markdown cells,code cells, comments, graphs, etc. Whatever you think helps adequately address the questions.

## 1) Load a dataset from UCI (via its URL)

Please navigate to the home page and choose a dataset (other than the Adult dataset) from the "Most Popular" section on the right-hand side of the home page. Load the dataset via its URL and check the following (show your work):

- Are the headers showing up properly?
- Look at the first 5 and the last 5 rows, do they seem to be in order?
- Does the dataset have the correct number of rows and columns as described in the UCI page? 
 - Remember, that UCI does not count the y variable (column of values that we might want to predict via a machine learning model) as an "attribute" but rather as a "class attribute" so you may end up seeing a number of columns that is one greater than the number listed on the UCI website.
- Does UCI list this dataset as having missing values? Check for missing values and see if your analysis corroborates what UCI reports?
- if `NaN` values or other missing value indicators are not being detected by `df.isnull().sum()` find a way to replace whatever is indicating the missing values with `np.NaN`.
- Use the .describe() function in order to see the summary statistics of both the numeric and non-numeric columns. 

In [0]:
import pandas as pd
import requests
import numpy as np
import re

In [0]:
# Loading in Relative CPU Performance Data from the UCI Repository
cpu_performance_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data', header=None)

In [0]:
# Checking the shape of the dataset to make sure it is the same as what is stated on the UCI Repository.
cpu_performance_df.shape
# Shape did not match because the first Row was being used as the header.
# To fix the header I used "pd.read_csv(url, header=None)" to get the dateset with the correct amount of Instances.

In [0]:
print('First 5 Rows: \n',cpu_performance_df.head(),'\n\n') # Show first 5 rows of DataFrame
print('Last 5 Rows: \n',cpu_performance_df.tail()) # Show last 5 rows of DataFrame

In [64]:
# So to fix the dataset I loaded the Attribute Info about the dataset from the UCI Repository to see if
# I could find the Attributes for the data set. 
names = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.names')
print(names.text)

1. Title: Relative CPU Performance Data 

2. Source Information
   -- Creators: Phillip Ein-Dor and Jacob Feldmesser
     -- Ein-Dor: Faculty of Management; Tel Aviv University; Ramat-Aviv; 
        Tel Aviv, 69978; Israel
   -- Donor: David W. Aha (aha@ics.uci.edu) (714) 856-8779   
   -- Date: October, 1987
 
3. Past Usage:
    1. Ein-Dor and Feldmesser (CACM 4/87, pp 308-317)
       -- Results: 
          -- linear regression prediction of relative cpu performance
          -- Recorded 34% average deviation from actual values 
    2. Kibler,D. & Aha,D. (1988).  Instance-Based Prediction of
       Real-Valued Attributes.  In Proceedings of the CSCSI (Canadian
       AI) Conference.
       -- Results:
          -- instance-based prediction of relative cpu performance
          -- similar results; no transformations required
    - Predicted attribute: cpu relative performance (numeric)

4. Relevant Information:
   -- The estimated relative performance values were estimated by the autho

In [27]:
# Useing the "df.columns" instance I gave all of my columns names useing the Attributes in the .name file from the UCI Repository
# 'https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.names'
cpu_performance_df.columns = ['Vendor', 'Model', 'MCT/ns', 'MIN_MEM/kb', 'MAX_MEM/kb', 'CACHE_MEM\kb', 'CHANNEL_MIN', 'CHANNEL_MAX', 'Published_RP', 'Estimated_RP' ]
# Now printing the head of the DataFrame to make sure changes took place.
cpu_performance_df.head()

Unnamed: 0,Vendor,Model,MCT/ns,MIN_MEM/kb,MAX_MEM/kb,CACHE_MEM\kb,CHANNEL_MIN,CHANNEL_MAX,Published_RP,Estimated_RP
0,adviser,32/60,125,256,6000,256,16,128,198,199
1,amdahl,470v/7,29,8000,32000,32,8,32,269,253
2,amdahl,470v/7a,29,8000,32000,32,8,32,220,253
3,amdahl,470v/7b,29,8000,32000,32,8,32,172,253
4,amdahl,470v/7c,29,8000,16000,32,8,16,132,132


In [0]:
# Now time to check if there are any missing values.
print('Check for Null values: \n\n',cpu_performance_df.isnull().sum(),'\n\n')
# After running this we can see that there are no null values in this DataFrame WOOHOO!!

In [0]:
# But lets look further:
# Using a for loop to check the values counts of each Column to see if any values look
# out of place.
for i in cpu_performance_df.columns:
  print('Attribute:',i,':Value Counts: \n')
  print(cpu_performance_df[i].value_counts(),'\n\n')
# After running this it appeared all values were accounted for.

In [0]:
# Using the "describe()" function, I will look at some more in depth info about the DataFrame
cpu_performance_df.describe()

In [66]:
# Using the "describe()" function with exclude='number' to see info about non-number values.
cpu_performance_df.describe(exclude='number')

Unnamed: 0,Vendor,Model
count,209,209
unique,30,209
top,ibm,4443
freq,32,1


## 2) Load a dataset from your local machine.
Choose a second dataset from the "Popular Datasets" listing on UCI, but this time download it to your local machine instead of reading it in via the URL. Upload the file to Google Colab using the files tab in the left-hand sidebar or by importing `files` from `google.colab` The following link will be a useful resource if you can't remember the syntax: <https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92>

- Answer all of the same bullet point questions from part 1 again on this new dataset. 


In [8]:
# Using google.colab._files to upload files from my local drive.
from google.colab import files
uploaded = files.upload()

Saving student.txt to student.txt
Saving student-mat.csv to student-mat.csv
Saving student-por.csv to student-por.csv


In [9]:
# Chicking to see if the uploaded files are in my current directory useing bash.
!pwd
!ls
# Create a new Directory named 'Student_Data' using bash.
!mkdir Student_Data/
# Move the desired Files to the new Directory using bash.
!mv student-mat.csv student-por.csv student.txt Student_Data/
# Moving to 'Student_Data' Directory
%cd Student_Data
# Checking for the files the were just moved.
!ls

/content
sample_data  student-mat.csv  student-por.csv  student.txt
/content/Student_Data
student-mat.csv  student-por.csv  student.txt


In [0]:
# Openning the txt file assosiated with the Datasets to view additional
# info about them.
txt_file = open('student.txt','r')
txt_string = txt_file.read().replace('"',' ')


In [0]:
print(txt_string)

In [0]:
# Closing the txt file after reading it.
# txt_file.close()

In [0]:
# Now to read in the csv files that are in 'Student_Data/'
df_mat = pd.read_csv('student-mat.csv')
df_por = pd.read_csv('student-por.csv')

In [0]:
# Since I've found a recent intrest in Regular Expressions, I decided to put it 
# to use here to make a way to extract the Attributes for my Dataset.
pattern = re.compile(r'\n\d{1,2}\s([a-zA-Z]*\d?)')
matches = pattern.findall(txt_string)
match_list = []
for m in matches:
  match_list.append(m)

In [16]:
match_list 

['school',
 'sex',
 'age',
 'address',
 'famsize',
 'Pstatus',
 'Medu',
 'Fedu',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'traveltime',
 'studytime',
 'failures',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic',
 'famrel',
 'freetime',
 'goout',
 'Dalc',
 'Walc',
 'health',
 'absences',
 'G1',
 'G2',
 'G3']

## 3) Make Crosstabs of the Categorical Variables

Take whichever of the above datasets has more categorical variables and use crosstabs to tabulate the different instances of the categorical variables.


In [0]:
# Your Code Here

## 4) Explore the distributions of the variables of the dataset using:
- Histograms
- Scatterplots
- Density Plots

In [0]:
# Your Code Here

## 5) Create at least one visualization from a crosstab:

Remember that a crosstab is just a dataframe and can be manipulated in the same way by row index, column, index, or column/row/cell position.


In [0]:
# Your Code Here

## Stretch Goals 

The following additional study tasks are optional, they are intended to give you an opportunity to stretch yourself beyond the main requirements of the assignment. You can pick and choose from the below, you do not need to complete them in any particular order.

### - Practice Exploring other Datasets

### -  Try using the Seaborn plotting library's "Pairplot" functionality in order to explore all of the possible histograms and scatterplots of your dataset all at once:

[Seaborn Pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html)

### - Turn some of the continuous variables into categorical variables by binning the values using:
- [pd.cut()](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html)
- [pd.qcut()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html)
- <https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut>

And then use crosstabs to compare/visualize these binned variables against the other variables.


### - Other types and sources of data
Not all data comes in a nice single file - for example, image classification involves handling lots of image files. You still will probably want labels for them, so you may have tabular data in addition to the image blobs - and the images may be reduced in resolution and even fit in a regular csv as a bunch of numbers.

If you're interested in natural language processing and analyzing text, that is another example where, while it can be put in a csv, you may end up loading much larger raw data and generating features that can then be thought of in a more standard tabular fashion.

Overall you will in the course of learning data science deal with loading data in a variety of ways. Another common way to get data is from a database - most modern applications are backed by one or more databases, which you can query to get data to analyze. We'll cover this more in our data engineering unit.

How does data get in the database? Most applications generate logs - text files with lots and lots of records of each use of the application. Databases are often populated based on these files, but in some situations you may directly analyze log files. The usual way to do this is with command line (Unix) tools - command lines are intimidating, so don't expect to learn them all at once, but depending on your interests it can be useful to practice.

One last major source of data is APIs: https://github.com/toddmotto/public-apis

API stands for Application Programming Interface, and while originally meant e.g. the way an application interfaced with the GUI or other aspects of an operating system, now it largely refers to online services that let you query and retrieve data. You can essentially think of most of them as "somebody else's database" - you have (usually limited) access.

*Stretch goal* - research one of the above extended forms of data/data loading. See if you can get a basic example working in a notebook. Image, text, or (public) APIs are probably more tractable - databases are interesting, but there aren't many publicly accessible and they require a great deal of setup.