<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Pandas: Importing Data
              
</p>
</div>

Data Science Cohort Live NYC Nov 2023
<p>Phase 1</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
   

Pandas can import/export data from/to a variety of file formats and sources:
    
- import data from CSV, JSON, Excel  spreadsheet  
- Use pandas to export a DataFrame to a file


In [None]:
# import our libraries
import numpy as np
import pandas as pd

Pandas functions for importing files into (or creating) DataFrames:
* `pd.read_csv()`
* `pd.read_excel()`
* `pd.read_json()`
* `pd.DataFrame.from_dict()`

CSV import:
- Example: data on patient's blood pressure:
`'bp.txt'`, stored in the `Data` folder.

In [None]:
# Import 'bp.txt' file. Tab separated entries.
df = pd.read_csv('Data/bp.txt', delimiter='\t')
# if delimiter is not specified assumes ',' delimiter

In [None]:
# Look at the first 3 rows
df.head(3)

In [None]:
# Look at the last 4 rows
df.tail(4)


#### Skipping and Limiting Rows

- Skipping rows: ignore metadata stored at the top of a file. Optional parameter `skiprows`. 
- Limit number of rows loaded: `nrows` parameter.

Let's look at another dataset:

In [None]:
# Import the first 100 rows of 'ACS_16_5YR_B24011_with_ann.csv' file
df = pd.read_csv('Data/ACS_16_5YR_B24011_with_ann.csv', nrows=100)

# Look at the first five rows
df.head()

#### Notice the first row is descriptions of the variables

Could just drop the first row:

In [None]:
# Delete the first row
df = df.drop(0)
df.head(2)

Or if we knew from the start, we could use the skiprows argument:

In [None]:
# Import the first 100 rows of 'ACS_16_5YR_B24011_with_ann.csv' file while skipping the first row
df = pd.read_csv('Data/ACS_16_5YR_B24011_with_ann.csv', skiprows= [1], nrows=100)
df.head()

#### Header

Related to `skiprows` is the `header` parameter:
- header = the row where column names are and starts importing data from that point.
- Set data description row as our header for fun.

In [None]:
# Look at the error output once you run this cell. What type of error is it?
df = pd.read_csv('Data/ACS_16_5YR_B24011_with_ann.csv', header=1)
df.head()

## Encoding

Encoding errors:
- Strings within the file are formatted according to encoding scheme. 
- Most common encoding: `utf-8` (default)
- Second most common: `latin-1`


In [None]:
# Import the 'ACS_16_5YR_B24011_with_ann.csv' file using a proper encoding
df = pd.read_csv('Data/ACS_16_5YR_B24011_with_ann.csv', header=1, encoding='latin-1')
df.head()

#### Selecting Specific Columns  

You can also select specific columns if you only want to load specific features.

In [None]:
# Import the file with specific columns
df = pd.read_csv('Data/ACS_16_5YR_B24011_with_ann.csv', 
                 usecols=[0, 1, 2, 5, 6], skiprows = [1], encoding='latin-1')
df.head(2)

**or**

In [None]:
# Import the file with specific columns
df = pd.read_csv('Data/ACS_16_5YR_B24011_with_ann.csv', skiprows = [1], usecols=['GEO.id', 'GEO.id2', 'HD01_VD02', 'HD02_VD02'], encoding='latin-1')
df.head(2)

#### Importing Excel files
Pretty similar to csv importing.

Load in an excel file containing some Yelp reviews.

In [None]:
# Import an Excel file
df1 = pd.read_excel('Data/Yelp_Selected_Businesses.xlsx', header=2)
df1.head()

But wait a minute...Excel file often has multiple sheets:

- By default, Excel imports first sheet.
- Excel file may contain multiple sheets.

If we know which sheet we want to import by sheet number:

In [None]:
# Import a specific sheet of an Excel file
df2 = pd.read_excel('Data/Yelp_Selected_Businesses.xlsx', sheet_name=2, header=2)
df2.head()

Or by sheet name:

In [None]:
# Import a specific sheet of an Excel file
df = pd.read_excel('Data/Yelp_Selected_Businesses.xlsx', sheet_name='Biz_id_RESDU', header=2)
df.head()

#### Loading a Full Workbook and Previewing Sheet Names
Load entire excel workbook (which is a collection of spreadsheets) with the `pd.ExcelFile()` function.

In [None]:
# Import the names of Excel sheets in a workbook
workbook = pd.ExcelFile('Data/Yelp_Selected_Businesses.xlsx')
workbook.sheet_names 

In [None]:
# Import a specific sheet
df = workbook.parse(sheet_name=1, header=2)
df.head()

In [None]:
df = workbook.parse(sheet_name='Biz_id_ujHia', header=2)
df.head()

#### Saving Data
Once we have data loaded that we may want to export back out, we use the **`.to_csv()`** or **`.to_excel()`** methods of any DataFrame object.

In [None]:
# Write data to a CSV file 
# Notice how we have to pass index=False if we do not want it included in our output
df.to_csv('NewSavedView.csv', index=False) 

In [None]:
# Write data to an Excel file 
df.to_excel('NewSavedView.xlsx')

#### Summary

Importing other file formats (JSONs, etc) into DataFrames are pretty similar. See pandas documentation for the rest.