# Importing Data Using Pandas

## Introduction

Pandas is a popular library for efficiently wrangling data. Pandas is particularly optimized to work with two-dimensional tabular data that is organized in rows and columns. In this lesson, you will learn how to import tabular data as a Pandas DataFrame object, how to access and manipulate the data in DataFrame objects, and how to export DataFrames to some common file formats. 

For more information on Pandas, refer to https://pandas.pydata.org/pandas-docs/stable/ .

## Objectives
You will be able to:
- Import data from csv files and Excel files
- Understand and explain key arguments for imports
- Save information to csv and Excel files
- Access data within a Pandas DataFrame (print() and .head())

## Loading Pandas
![pleasedont](https://media.giphy.com/media/5zjdLoaVHKHuhQdtRF/giphy.gif)

When importing Pandas, it is standard to import it under the alias `pd`

In [1]:
import pandas as pd  # pd is an alias

## Importing Data

There are a few main functions for importing data into a Pandas DataFrame including:

* `pd.read_csv()`
* `pd.read_excel()`
* `pd.read_json()`
* `pd.DataFrame.from_dict()`

Most of these methods are fairly straightforward; you use `read_csv()` for csv files, `read_excel()` for excel files (both new and old `.xlx` and `.xlsx` formats), and `read_json()` for json files. That said, there are a few nuances you should know about. The first is that the `read_csv()` format can be used for any plain-text delimited file. This may include (but is not limited to) pipe (|) delimited files (`.psv`) and tab separated files (`.tsv`).

Let's look at an example by investigating a file, `'bp.txt'`, stored in the `Data` folder.

In [2]:
data = pd.read_csv("../data/bp.txt")
data.head()

Unnamed: 0,Pt\tBP\tAge\tWeight\tBSA\tDur\tPulse\tStress
0,1\t105\t47\t85.4\t1.75\t5.1\t63\t33
1,2\t115\t49\t94.2\t2.10\t3.8\t70\t14
2,3\t116\t49\t95.3\t1.98\t8.2\t72\t10
3,4\t117\t50\t94.7\t2.01\t5.8\t73\t99
4,5\t112\t51\t89.4\t1.89\t7.0\t72\t95


In [3]:
data = pd.read_csv("../data/bp.txt" , delimiter="\t")
data.head()

Unnamed: 0,Pt,BP,Age,Weight,BSA,Dur,Pulse,Stress
0,1,105,47,85.4,1.75,5.1,63,33
1,2,115,49,94.2,2.1,3.8,70,14
2,3,116,49,95.3,1.98,8.2,72,10
3,4,117,50,94.7,2.01,5.8,73,99
4,5,112,51,89.4,1.89,7.0,72,95


+ We've now loaded the data from a file into a DataFrame. To investigate the DataFrame, we can use a method called `.head(n)` or `.tail(n)`, which will respectively return first and last __n__ items in the DataFrame.

In [6]:
data.head(3)

Unnamed: 0,Pt,BP,Age,Weight,BSA,Dur,Pulse,Stress
0,1,105,47,85.4,1.75,5.1,63,33
1,2,115,49,94.2,2.1,3.8,70,14
2,3,116,49,95.3,1.98,8.2,72,10


In [7]:
data.tail(2)

Unnamed: 0,Pt,BP,Age,Weight,BSA,Dur,Pulse,Stress
18,19,110,48,90.5,1.88,9.0,71,99
19,20,122,56,95.7,2.09,7.0,75,99


This example shows that the data was tab delimited (`\t`), so an appropriate file extension could have also been `.tsv`. Once we've loaded the dataset, we can export it to any format we would like with the related methods:

* `df.to_csv()`
* `df.to_excel()`
* `df.to_json()`
* `df.to_dict()`

There are also several other options available, but these are the most common.

## Skipping and Limiting Rows

Another feature that you may have to employ is skipping rows when there is metadata stored at the top of a file. You can do this using the optional parameter `skiprows`. Similarly, if you want to only load a portion of a large file as an initial preview, you can use the `nrows` parameter.

In [9]:
df = pd.read_csv("../data/ACS_16_5YR_B24011_with_ann.csv" , nrows=100)
df.head()

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,HD01_VD01,HD02_VD01,HD01_VD02,HD02_VD02,HD01_VD03,HD02_VD03,HD01_VD04,...,HD01_VD32,HD02_VD32,HD01_VD33,HD02_VD33,HD01_VD34,HD02_VD34,HD01_VD35,HD02_VD35,HD01_VD36,HD02_VD36
0,Id,Id2,Geography,Estimate; Total:,Margin of Error; Total:,"Estimate; Total: - Management, business, scien...","Margin of Error; Total: - Management, business...","Estimate; Total: - Management, business, scien...","Margin of Error; Total: - Management, business...","Estimate; Total: - Management, business, scien...",...,"Estimate; Total: - Natural resources, construc...","Margin of Error; Total: - Natural resources, c...","Estimate; Total: - Production, transportation,...","Margin of Error; Total: - Production, transpor...","Estimate; Total: - Production, transportation,...","Margin of Error; Total: - Production, transpor...","Estimate; Total: - Production, transportation,...","Margin of Error; Total: - Production, transpor...","Estimate; Total: - Production, transportation,...","Margin of Error; Total: - Production, transpor..."
1,0500000US01001,01001,"Autauga County, Alabama",33267,2306,48819,1806,55557,4972,63333,...,31402,5135,35594,3034,36059,3893,47266,13608,19076,4808
2,0500000US01003,01003,"Baldwin County, Alabama",31540,683,49524,1811,57150,6980,63422,...,35603,3882,30549,1606,29604,4554,35504,6260,24182,3580
3,0500000US01005,01005,"Barbour County, Alabama",26575,1653,41652,2638,51797,5980,52775,...,37847,11189,26094,4884,25339,4900,37282,6017,16607,3497
4,0500000US01007,01007,"Bibb County, Alabama",30088,2224,40787,2896,50069,12841,67917,...,45952,5622,28983,3401,31881,2317,26580,2901,23479,4942


### Notice the first row is descriptions of the variables

+ We can drop them manually

In [10]:
df = df.drop(0)
df.head(3)

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,HD01_VD01,HD02_VD01,HD01_VD02,HD02_VD02,HD01_VD03,HD02_VD03,HD01_VD04,...,HD01_VD32,HD02_VD32,HD01_VD33,HD02_VD33,HD01_VD34,HD02_VD34,HD01_VD35,HD02_VD35,HD01_VD36,HD02_VD36
1,0500000US01001,1001,"Autauga County, Alabama",33267,2306,48819,1806,55557,4972,63333,...,31402,5135,35594,3034,36059,3893,47266,13608,19076,4808
2,0500000US01003,1003,"Baldwin County, Alabama",31540,683,49524,1811,57150,6980,63422,...,35603,3882,30549,1606,29604,4554,35504,6260,24182,3580
3,0500000US01005,1005,"Barbour County, Alabama",26575,1653,41652,2638,51797,5980,52775,...,37847,11189,26094,4884,25339,4900,37282,6017,16607,3497


## Header

Related to `skiprows` is the `header` parameter. This specifies the row where column names are and starts importing data from that point:

In [5]:
# look at the error output once you run this cell. What type of error is it?
df = pd.read_csv('../data/ACS_16_5YR_B24011_with_ann.csv', header=1)
df.head()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte

## Encoding

Encoding errors like the one above are always frustrating. This has to do with how the strings within the file itself are formatted. The most common encoding other than `utf-8` that you are likely to come across is `latin-1`.

In [12]:
df = pd.read_csv("../data/ACS_16_5YR_B24011_with_ann.csv" , header=1 , encoding="latin-1")
df.head()

Unnamed: 0,Id,Id2,Geography,Estimate; Total:,Margin of Error; Total:,"Estimate; Total: - Management, business, science, and arts occupations:","Margin of Error; Total: - Management, business, science, and arts occupations:","Estimate; Total: - Management, business, science, and arts occupations: - Management, business, and financial occupations:","Margin of Error; Total: - Management, business, science, and arts occupations: - Management, business, and financial occupations:","Estimate; Total: - Management, business, science, and arts occupations: - Management, business, and financial occupations: - Management occupations",...,"Estimate; Total: - Natural resources, construction, and maintenance occupations: - Installation, maintenance, and repair occupations","Margin of Error; Total: - Natural resources, construction, and maintenance occupations: - Installation, maintenance, and repair occupations","Estimate; Total: - Production, transportation, and material moving occupations:","Margin of Error; Total: - Production, transportation, and material moving occupations:","Estimate; Total: - Production, transportation, and material moving occupations: - Production occupations","Margin of Error; Total: - Production, transportation, and material moving occupations: - Production occupations","Estimate; Total: - Production, transportation, and material moving occupations: - Transportation occupations","Margin of Error; Total: - Production, transportation, and material moving occupations: - Transportation occupations","Estimate; Total: - Production, transportation, and material moving occupations: - Material moving occupations","Margin of Error; Total: - Production, transportation, and material moving occupations: - Material moving occupations"
0,0500000US01001,1001,"Autauga County, Alabama",33267,2306,48819,1806,55557,4972,63333,...,31402,5135,35594,3034,36059,3893,47266,13608,19076,4808
1,0500000US01003,1003,"Baldwin County, Alabama",31540,683,49524,1811,57150,6980,63422,...,35603,3882,30549,1606,29604,4554,35504,6260,24182,3580
2,0500000US01005,1005,"Barbour County, Alabama",26575,1653,41652,2638,51797,5980,52775,...,37847,11189,26094,4884,25339,4900,37282,6017,16607,3497
3,0500000US01007,1007,"Bibb County, Alabama",30088,2224,40787,2896,50069,12841,67917,...,45952,5622,28983,3401,31881,2317,26580,2901,23479,4942
4,0500000US01009,1009,"Blount County, Alabama",34900,2063,46593,2963,47003,6189,50991,...,42489,7176,32969,3767,31814,4551,41375,5280,26755,2963


## Selecting Specific Columns  

+ You can also select specific columns if you only want to load specific features.

In [13]:
df = pd.read_csv('../data/ACS_16_5YR_B24011_with_ann.csv', usecols=[0,1,2,5,6], encoding='latin-1')
df.head(2)

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,HD01_VD02,HD02_VD02
0,Id,Id2,Geography,"Estimate; Total: - Management, business, scien...","Margin of Error; Total: - Management, business..."
1,0500000US01001,01001,"Autauga County, Alabama",48819,1806


# Or we can simply state

In [14]:
df = pd.read_csv('../data/ACS_16_5YR_B24011_with_ann.csv',
                 usecols=["GEO.id","GEO.id2","GEO.display-label"], encoding='latin-1')
df.head(2)


Unnamed: 0,GEO.id,GEO.id2,GEO.display-label
0,Id,Id2,Geography
1,0500000US01001,01001,"Autauga County, Alabama"


## Selecting Specific Sheets
+ You can also select specific sheets for Excel files! This can be done by index number. 

In [6]:
df_new = pd.read_excel("../data/Yelp_Selected_Businesses.xlsx")
df_new.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,,,,,,,,,
1,business_id,cool,date,funny,review_id,stars,text,useful,user_id
2,RESDUcs7fIiihp38-d6_6g,0,2015-09-16,0,gkcPdbblTvZDMSwx8nVEKw,5,Got here early on football Sunday 7:30am as I ...,0,SKteB5rgDlkkUa1Zxe1N0Q
3,RESDUcs7fIiihp38-d6_6g,0,2017-09-09,0,mQfl6ci46mu0xaZrkRUhlA,5,"This buffet is amazing. Yes, it is expensive,...",0,f638AHA_GoHbyDB7VFMz7A
4,RESDUcs7fIiihp38-d6_6g,0,2013-01-14,0,EJ7DJ8bm7-2PLFB9WKx4LQ,3,I was really looking forward to this but it wa...,0,-wVPuTiIEG85LwTK46Prpw


# Skipping the header

In [17]:
df_new = pd.read_excel("../data/Yelp_Selected_Businesses.xlsx" , header=2)
df_new.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,RESDUcs7fIiihp38-d6_6g,0,2015-09-16,0,gkcPdbblTvZDMSwx8nVEKw,5,Got here early on football Sunday 7:30am as I ...,0,SKteB5rgDlkkUa1Zxe1N0Q
1,RESDUcs7fIiihp38-d6_6g,0,2017-09-09,0,mQfl6ci46mu0xaZrkRUhlA,5,"This buffet is amazing. Yes, it is expensive,...",0,f638AHA_GoHbyDB7VFMz7A
2,RESDUcs7fIiihp38-d6_6g,0,2013-01-14,0,EJ7DJ8bm7-2PLFB9WKx4LQ,3,I was really looking forward to this but it wa...,0,-wVPuTiIEG85LwTK46Prpw
3,RESDUcs7fIiihp38-d6_6g,0,2017-02-08,0,lMarDJDg4-e_0YoJOKJoWA,2,This place....lol our server was nice. But fo...,0,A21zMqdN76ueLZFpmbue0Q
4,RESDUcs7fIiihp38-d6_6g,0,2012-11-19,0,nq_-8lZPUVGomDEP5OOj1Q,1,"After hearing all the buzz about this place, I...",2,Jf1EXieUV7F7s-HGA4EsdA


# Or 

In [17]:
df_new2 = pd.read_excel("../data/Yelp_Selected_Businesses.xlsx" ,sheet_name=2, header=2)
df_new2.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,YJ8ljUhLsz6CtT_2ORNFmg,1,2013-04-25,0,xgUz0Ck4_ciNaeIk-H8GBQ,5,I loved this place. Easily the most hipsters p...,1,6cpo8iqgnW3jnozhmY7eAA
1,YJ8ljUhLsz6CtT_2ORNFmg,0,2014-07-07,0,Au7MG4QlAxqq9meyKSQmaw,5,So my boyfriend and I came here for my birthda...,0,8bFE3u1dMoYXkS7ORqlssw
2,YJ8ljUhLsz6CtT_2ORNFmg,0,2015-12-04,0,8IQnZ54nenXjlK-FGZ82Bg,5,I really enjoyed their food. Went there for th...,1,bJmE1ms0MyZ6KHjmfZDWGw
3,YJ8ljUhLsz6CtT_2ORNFmg,2,2016-07-06,1,XY42LMhKoXzwtLoku4mvLA,5,A complete Vegas experience. We arrived right ...,3,PbccpC-I-8rxzF2bCDh8YA
4,YJ8ljUhLsz6CtT_2ORNFmg,0,2014-04-15,0,1xlYVWhyLedoA0HddOJMOw,4,Very great atmosphere had a wonderful bartende...,0,yvlRColhqo_4TzpUFKyroA


### Or the name of the sheet itself : 'Biz_id_RESDU'

In [19]:
df = pd.read_excel('../data/Yelp_Selected_Businesses.xlsx', sheet_name='Biz_id_RESDU', header=2)
df.head(3)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,RESDUcs7fIiihp38-d6_6g,0,2015-09-16,0,gkcPdbblTvZDMSwx8nVEKw,5,Got here early on football Sunday 7:30am as I ...,0,SKteB5rgDlkkUa1Zxe1N0Q
1,RESDUcs7fIiihp38-d6_6g,0,2017-09-09,0,mQfl6ci46mu0xaZrkRUhlA,5,"This buffet is amazing. Yes, it is expensive,...",0,f638AHA_GoHbyDB7VFMz7A
2,RESDUcs7fIiihp38-d6_6g,0,2013-01-14,0,EJ7DJ8bm7-2PLFB9WKx4LQ,3,I was really looking forward to this but it wa...,0,-wVPuTiIEG85LwTK46Prpw


## Loading a Full Workbook and Previewing Sheet Names
+ You can also load an entire excel workbook (which is a collection of spreadsheets) with the `pd.ExcelFile()` function.

In [20]:
excelbooks = pd.ExcelFile("../data/Yelp_Selected_Businesses.xlsx")
excelbooks.sheet_names

['Biz_id_RESDU',
 'Biz_id_4JNXU',
 'Biz_id_YJ8lj',
 'Biz_id_ujHia',
 'Biz_id_na4Th']

In [21]:
df_excel = excelbooks.parse(sheet_name=1, header=2)
df_excel.head(3)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,4JNXUYY8wbaaDmk3BPzlWw,0,2012-06-10,0,wl8BO_I-is-JaMwMW5c_gQ,4,I booked a table here for brunch and it did no...,0,fo4mpUqgXL2mJqALc9AvbA
1,4JNXUYY8wbaaDmk3BPzlWw,0,2012-01-20,0,cf9RrqHY9eQ9M53OPyXLtg,4,Came here for lunch after a long night of part...,0,TVvTtXwPXsvrg2KJGoOUTg
2,4JNXUYY8wbaaDmk3BPzlWw,0,2017-05-10,0,BvmhSQ6WFm2Jxu01G8OpdQ,5,Loved the fried goat cheese in tomato sauce al...,0,etbAVunw-4kwr6VTRweZpA


## Saving Data
Once we have data loaded that we may want to export back out, we use the **`.to_csv()`** or **`.to_excel()`** methods of any DataFrame object.

+ Notice how we have to pass index=False if we do not want it included in our output

In [35]:
df.to_csv('../data/NewSavedView.csv', index=False) 

# <a id='7'>Saving File To CSV as a compression file</a>

In [None]:
compression_opts = dict(method='zip',
                        archive_name='Cleaned.csv')  
df.to_csv('Cleaned.zip', index=False,
          compression=compression_opts) 

# Importing messy data

In [22]:
import warnings
warnings.filterwarnings('ignore')

1. __Import__ the csv-file __cars_raw.csv__ with the appropriate pandas method and __inspect__ the data!

2. __Remove__ the __first row(s)__ containing nonsense content.

3. __Remove__ the __last row(s)__ containing nonsense content.

4. Define that there are __no appropriate column labels/headers__ in the data. 

5. __Set__ the following __column labels/headers__:

**labels = ['mpg','cylinders','displacement','horsepower','weight', 'acceleration'
          'model',  'year','origin','name']**


In [18]:
pd.read_csv("../data/cars_raw.csv")

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Welcome,to,the,cars,Dataset!
Feel,free,to,analyze,and,clean,the,messy,Dataset,!
18.0,8,307.0,130.0 hp,3504,12.0,70,United States,chevrolet chevelle malibu,
15.0,8,350.0,165.0 hp,3693,11.5,70,United States,buick skylark 320,
18.0,8,318.0,150.0 hp,3436,11.0,70,United States,plymouth satellite,
16.0,8,304.0,150.0 hp,3433,12.0,70,usa,amc rebel sst,
...,...,...,...,...,...,...,...,...,...
27.0,4,101.0,83.0 hp,2202,15.3,76,europe,renault 12tl,
17.0,6,250.0,100.0 hp,3329,15.5,71,usa,chevrolet chevelle malibu,
14.5,8,351.0,152.0 hp,4215,12.8,76,usa,ford gran torino,
25.0,6,181.0,110.0 hp,2945,16.4,82,usa,buick century limited,


2. Use appropriate __parameters__ in the __pd.read_csv()__ method to clean the format. The following issues need to be solved: # create the column names and attache it to the dataframe

In [20]:
columns = ['mpg',"cylinders","dsiplacement","horsepower","weight","acceleration",
           "model year", "origin", "name"]

In [23]:
pd.read_csv("../data/cars_raw.csv",skiprows=2, skipfooter=1, header=None, names=columns)

Unnamed: 0,mpg,cylinders,dsiplacement,horsepower,weight,acceleration,model year,origin,name
0,18.0,8,307.0,130.0 hp,3504,12.0,70,United States,chevrolet chevelle malibu
1,15.0,8,350.0,165.0 hp,3693,11.5,70,United States,buick skylark 320
2,18.0,8,318.0,150.0 hp,3436,11.0,70,United States,plymouth satellite
3,16.0,8,304.0,150.0 hp,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0 hp,3449,10.5,70,usa,FORD TORINO
...,...,...,...,...,...,...,...,...,...
324,12.0,8,429.0,198.0 hp,4952,11.5,73,usa,mercury marquis brougham
325,27.0,4,101.0,83.0 hp,2202,15.3,76,europe,renault 12tl
326,17.0,6,250.0,100.0 hp,3329,15.5,71,usa,chevrolet chevelle malibu
327,14.5,8,351.0,152.0 hp,4215,12.8,76,usa,ford gran torino


In [24]:
Car_data= pd.read_csv("../data/cars_raw.csv",skiprows=2, skipfooter=1, 
                      header=None, names=columns)
Car_data.head(4)

Unnamed: 0,mpg,cylinders,dsiplacement,horsepower,weight,acceleration,model year,origin,name
0,18.0,8,307.0,130.0 hp,3504,12.0,70,United States,chevrolet chevelle malibu
1,15.0,8,350.0,165.0 hp,3693,11.5,70,United States,buick skylark 320
2,18.0,8,318.0,150.0 hp,3436,11.0,70,United States,plymouth satellite
3,16.0,8,304.0,150.0 hp,3433,12.0,70,usa,amc rebel sst


In [25]:
Car_data.tail(4)

Unnamed: 0,mpg,cylinders,dsiplacement,horsepower,weight,acceleration,model year,origin,name
325,27.0,4,101.0,83.0 hp,2202,15.3,76,europe,renault 12tl
326,17.0,6,250.0,100.0 hp,3329,15.5,71,usa,chevrolet chevelle malibu
327,14.5,8,351.0,152.0 hp,4215,12.8,76,usa,ford gran torino
328,25.0,6,181.0,110.0 hp,2945,16.4,82,usa,buick century limited


In [33]:
Car_data["origin"].plot(kind="bar")

TypeError: no numeric data to plot

In [30]:
Car_data.isna().sum()

mpg             0
cylinders       0
dsiplacement    0
horsepower      0
weight          0
acceleration    0
model year      0
origin          0
name            0
dtype: int64

6. __Export__ and __save__ cars as new csv-file (__cars_imp.csv__). Do __not__ export any __RangeIndex__!

In [40]:
Car_data.to_csv("../data/cars_import.csv", index= False)

# Summary

+ Great job
+ In the session we got introduced to the world of Pandas. 

![dance](https://media.giphy.com/media/gLbpuugfk2nILFepeI/giphy.gif)