# Lecture 7 Otaining Data from a File
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney, *Python for Data Science*, Chapter 6](https://wesmckinney.com/book/accessing-data)

Class notes are found through GitHub. As changes are made, they will automatically be uploaded to GitHub. A link to the repository is on Canvas.

-----
## Outline
* File Requirements
  * Description of Variables (in file or as a separate document)
* Loading Data from a File
  * Column Names
  * Headers
  * Showing just the head or tail of the data
  * Splitting Variables

In [None]:
import pandas as pd

-----
## Loading Data from File

Data needs to be stored in a certain file format to be loaded into Python. Python can handle many, but Pandas is able to load more.
* [File types handled by Python - McKinney, Chapter 6](https://wesmckinney.com/book/accessing-data#tbl-table_parsing_functions)

Regardless of the file type, we must know what we are dealing with. Consider this dataset:

In [None]:
import pandas as pd
grades = pd.read_csv("../Datasets/grades.csv")
grades

Uh-oh! Something is wrong. We can see the data, and from the name of the file, it looks like we have some sort of list of grades. We have the names of the students, but we have no idea what each variable in our DataFrame is.

Let's look now at a few requirements for good datasets.

-----
## Good Rules for Handling Data

First thing, take a look at the data. It is always good to see what the data looks like before we try to do anything with it.
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades.csv](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades.csv)

1. Look at your data before loading it
2. When you are saving data, make sure there is documentation
    * Documentation includes an explanation of what the dataset is, what the columns of the dataset are, and units for each variable.

This file has data, but the columns have no labels. When we tried to import earlier, the first line was assumed as a header line and became labels for the columns. To avoid that, we tell Pandas that there is no header.

In [None]:
grades = pd.read_csv('../Datasets/grades.csv', header=None)
grades.head(10)

However, this is as far as we can go. Without an explanation of the variables, our dataset is basically useless, we we can't decide what to do with it.

Every dataset MUST have some explanation of the variables in our dataset. Very often, this is done with a separate README file. For this dataset, look at the documentation here:
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades_README.txt](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades_README.txt)

Using this, we can update our DataFrame.

In [None]:
grades = pd.read_csv('../Datasets/grades.csv', header=None)
grades.columns = ['Name','Attendance','Homework','Project_Proposal',
                   'Project_Checkup','Project_Final','Midterm','Final']

# See the first few rows of the data (default=5)
grades.head()   

Sometimes, the variables are labeled in the data, as in the following example, though the labels could be in code. Again, the README file should explain what each means.
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades1.csv](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades1.csv)

In [None]:
grades = pd.read_csv('../Datasets/grades1.csv') # No Header statement means header=1
display(grades.head(4))

grades.columns = ['Name','Attendance','Homework','Project_Proposal',
                   'Project_Checkup','Project_Final','Midterm','Final']
display(grades.head(4))

Sometimes, the file is separated by a character other than a comma. Some common separators (a.k.a. delimiters) include:
* `;` a semi-colon
* `:` a colon
* ` ` a space
* `\t` a tab

Here's the same dataset, only this time separated by a semicolon (;).
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades2.csv](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades2.csv)

In [None]:
grades = pd.read_csv('../Datasets/grades2.csv')
grades.columns = ['Name','Attendance','Homework','Project_Proposal',
                   'Project_Checkup','Project_Final','Midterm','Final']

display(grades.head(4))

Finally, we there is one other way documentation might be provided. We might see the labels *in the file itself*. If we are following good practice, we will open the file before importing the data and see these labels.
* [https://github.com/drolsonmi/math3080/blob/main/Datasets/grades3.csv](https://github.com/drolsonmi/math3080/blob/main/Datasets/grades3.csv)

In this case, there are extra lines ahead of the data which explain the data itself. But this makes loading the data more difficult. For this, use the `skiprows=` argument.

In [None]:
grades = pd.read_csv('../Datasets/grades3.csv', skiprows=9)
display(grades.head(4))

grades.columns = ['Name','Attendance','Homework','Project_Proposal',
                   'Project_Checkup','Project_Final','Midterm','Final']
display(grades.head(4))

-----
## Loading data from an Excel file

We could simply read an excel file just like any other file type.

In [None]:
grades = pd.read_excel('../grades3.xlsx', skiprows=10)
grades

However, Pandas offers some more functionality. If we use `pd.ExcelFile()` to load the file, then we get an object with all worksheets within the file. We can then choose which one we want to work with.

In [None]:
excel = pd.ExcelFile('grades3.xlsx')
excel

In [None]:
excel.sheet_names

In [None]:
grades = excel.parse(sheet_name='grades3', skiprows=10)
grades

-----
## Loading data from a file on the Internet

Loading a file from the internet is just like reading it from a file. Just use a web directory instead of a file directory.

In [None]:
salaries = pd.read_csv('https://raw.githubusercontent.com/drolsonmi/math3080/main/Datasets/data_science_salaries.csv')
salaries.head(4)

-----
## Web Scraping
Web scraping is a very important tool and technique. A lot of the data is on the internet, set up as a table on an HTML page.

If you are like me, you have handled HTML tables by going to the webpage, copying the table, putting them into an excel file, then work for an hour or two trying to format the data to a format you can use. This is obnoxious and takes way too much time.

The idea of __web scraping__ is to go through the file itself and identify any tables in the file. Most commonly, we apply web scraping to *HTML* and *xml* files.

### Web Scraping with HTML
In Pandas, we have a `pd.read_html()` function. This will take the given HTML file and look for `<table>` tags. If it finds one, then it can decode the table and save it as a DataFrame.
* to read a table from an html file, install the `lxml` package

What if there are multiple tables on the webpage? The `pd.read_html()` command will find all `<table>` tags and convert all of them into a DataFrame, saving them all into an array. So, the output of the `pd.read_html()` command is an array. To access the table you want, just call that table from the array.

Look at the [author's FDIC example](https://raw.githubusercontent.com/wesm/pydata-book/3rd-edition/examples/fdic_failed_bank_list.html)

In [None]:
banks = pd.read_html('https://raw.githubusercontent.com/wesm/pydata-book/3rd-edition/examples/fdic_failed_bank_list.html')
display(banks[0])

### Web Scraping with XML