<hr style="border-top: 10px groove limegreen; margin-top: 1px; margin-bottom: 1px"></hr>

# <font color=limegreen>Acquire Data for Classification</font>

**A Few Example Methods for Reading Data into Pandas DataFrames**

<hr style="border-top: 10px groove limegreen; margin-top: 1px; margin-bottom: 1px"></hr>

# Big Ideas

- Cache your data to speed up your data acquisition.

- Helper functions are your friends.

# Objectives 

**By the end of the acquire lesson and exercises, you will be able to...**

- **read data into a pandas DataFrame using the following modules:**

>pydataset
    
```python
from pydataset import data
df = data('dataset_name')
```
>seaborn datasets
    
```python
import seaborn as sns
df = sns.load_dataset('dataset_name')
```

- **read data into a pandas DataFrame from the following sources:**

    - an Excel spreadsheet

    - a Google sheet
    
    - Codeup's mySQL database

```python
pd.read_excel('file_name.xlsx', sheet_name='sheet_name')
pd.read_csv('filename.csv')
pd.read_sql(sql_query, connection_url)
```

- **use pandas methods and attributes to do some initial summarization and exploration of your data.**

```python
.head()
.shape
.info()
.columns
.dtypes
.describe()
.value_counts()
```

- **create functions that acquire data from Codeup's database, save the data locally to CSV files (cache your data), and check for CSV files upon subsequent use.**


- **create a new python module, `acquire.py`, that holds your functions that acquire the titanic and iris data and can be imported and called in other notebooks and scripts.**

<hr style="border-top: 10px groove limegreen; margin-top: 1px; margin-bottom: 1px"></hr>

In [2]:
import pandas as pd
import numpy as np
import os

# visualize
import seaborn as sns
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(11, 9))
plt.rc('font', size=13)

# turn off pink warning boxes
import warnings
warnings.filterwarnings("ignore")

# acquire
from env import host, user, password
from pydataset import data

<hr style="border-top: 10px groove limegreen; margin-top: 1px; margin-bottom: 1px"></hr>

### From a Database

Create your DataFrame using a SQL query to access a database.

**<font color=purple>Use your env file info and create your sql query and create connection_url for use in pandas `read_sql()` function.</font>**

```python
# Import private info to keep it secret in public files.
from env import host, password, user

# Test query in Sequel Pro and save to a variable.
sql_query = 'write your sql query here; test it in Sequel Pro first!'

# Save connection url to a variable for use with pandas `read_sql()` function.
connection_url = f'mysql+pymysql://{user}:{password}@{host}/database_name'
    
# Python function to read data from database into a DataFrame.
pd.read_sql(sql_query, connection_url)
```

In [3]:
# Create sql query and save to variable.

sql_query = 'SELECT * FROM passengers'

In [4]:
# Create connection url and save to a variable.

connection_url = f'mysql+pymysql://{user}:{password}@{host}/titanic_db'

In [5]:
# Use my variables in the pandas read_sql() function.

titanic_df = pd.read_sql(sql_query, connection_url)
titanic_df.head(3)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1


In [6]:
titanic_df.shape

(891, 13)

___

### From Files

- Create your DataFrame from a csv file.

```python
df = pd.read_csv('file_path/file_name.csv')
```
- Create your DataFrame from an AWS S3 file. (amazon web services)

```python
df = pd.read_csv('https://s3.amazonaws.com/bucket_and_or_file_name.csv')
```

- Create your DataFrame from a Google sheet using its Share url.

```python
sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'
```  

```python
csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
```

```python
df = pd.read_csv(csv_export_url)
```

In [7]:
# Assign our Google Sheet share url to a variable. (make sure share settings are appropriate)

sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'

In [8]:
# Use the replace method to modify our Google Sheet share url to be a csv export url.

csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')

In [9]:
# Use read_csv() method to create our DataFrame.

df_googlesheet = pd.read_csv(csv_export_url)
df_googlesheet.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C


___

### From Your Clipboard

Read copy-pasted tabular data and parse it into a DataFrame.

```python
# Default
df = pd.read_clipboard(sep='\\s+', **kwargs)

# Some examples of options I have.
columns = ['column_1', 'column_2', 'column_3']
df = pd.read_clipboard(sep=',', header=None, names=columns)
```

[Here's](https://towardsdatascience.com/pandas-hacks-read-clipboard-94a05c031382) a short and sweet article that explains it all nicely.

In [11]:
# Try out the read_clipboard() method here using the article.

pd.read_clipboard()

Unnamed: 0,0,1,2,3
0,0.850004,0.206778,0.6552,0.079339
1,0.948567,0.749701,0.116241,0.069551
2,0.834722,0.360724,0.410327,0.535236
3,0.221309,0.916424,0.649175,0.80375


In [12]:
# Try out the read_clipboard() method with data without headers/column names.

pd.read_clipboard()

Unnamed: 0,Unnamed: 1,Unnamed: 2,"1,0,3,""Braund,",Mr.,Owen,"Harris"",male,22,1,0,A/5","21171,7.25,,S"
"2,1,1,""Cumings,",Mrs.,John,Bradley,(Florence,Briggs,"Thayer)"",female,38,1,0,PC","17599,71.2833,C85,C"
"3,1,3,""Heikkinen,",Miss.,"Laina"",female,26,0,0,STON/O2.","3101282,7.925,,S",,,,
"4,1,1,""Futrelle,",Mrs.,Jacques,Heath,(Lily,May,"Peel)"",female,35,1,0,113803,53.1,C123,S",
"5,0,3,""Allen,",Mr.,William,"Henry"",male,35,0,0,373450,8.05,,S",,,,
"6,0,3,""Moran,",Mr.,"James"",male,,0,0,330877,8.4583,,Q",,,,,
"7,0,1,""McCarthy,",Mr.,Timothy,"J"",male,54,0,0,17463,51.8625,E46,S",,,,
"8,0,3,""Palsson,",Master.,Gosta,"Leonard"",male,2,3,1,349909,21.075,,S",,,,
"9,1,3,""Johnson,",Mrs.,Oscar,W,(Elisabeth,Vilhelmina,"Berg)"",female,27,0,2,347742,11.1333,,S",
"10,1,2,""Nasser,",Mrs.,Nicholas,(Adele,"Achem)"",female,14,1,0,237736,30.0708,,C",,,
"11,1,3,""Sandstrom,",Miss.,Marguerite,"Rut"",female,4,1,1,PP","9549,16.7,G6,S",,,


In [13]:
#above is messy because using default
#adjust and clean up by using:
columns = [
    'PassengerId', 'Survived', 'Pclass', 'Name',
    'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
    'Cabin', 'Embarked'] #col pulled from article link

df = pd.read_clipboard(sep=',', header=None, names=columns)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


___

### From an Excel Sheet

```python
pd.read_excel('your_excel_file_name.xlsx', sheet_name='your_table_name', usecols=['this_one', 'this_one'])
```

In [32]:
# Read in one sheet from my_telco_churn excel workbook.

customers_df = pd.read_excel('my_telco_churn.xlsx', sheet_name='Table2_CustDetails')
customers_df.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'Copy of Cust_Churn_Telco.xlsx'

___

### From Pydataset

Create your DataFrame using Pydataset and Read the Doc.

```python
from pydataset import data

data('iris', show_doc=True)

df_iris = data('iris')
```

In [15]:
# Create DataFrame using pydataset 'iris'

df_iris = data('iris')
df_iris.head(3)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa


In [16]:
# Using Seaborn Datasets. This one has nice column names! :)

iris = sns.load_dataset('iris')
iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


<hr style="border-top: 10px groove limegreen; margin-top: 1px; margin-bottom: 1px"></hr>

## Automating Data Acquisition

-  The process of acquiring, preparing, exploring, modeling, and evaluating data is called the Data Science Pipeline.


- As we go through the pipeline, our goal is to end each stage with functions that automate the process and can feed into the next stage, making our work faster and more importantly, repeatable.


- We store our functions from each stage in modules, `acquire.py`, `prepare.py`, etc., and import them for use in our notebooks. All of the helper and main functions are stored in the `.py` file or module to keep our notebook clean and readable.


- Ideally, upon completing the entire process, we should be able to use all of our functions, from each stage, to create one pipeline function that can reproduce our entire process from aquisition to evaluation.


- If our goal is to acquire the titanic data from the Codeup database, both of the funtions below would be stored in an `acquire.py` file and imported into our notebook for use.

<hr style="border-top: 10px groove limegreen; margin-top: 1px; margin-bottom: 1px"></hr>

**<font color=purple>Put it all together in a single function that acquires new data from the Codeup database and save it, as well as any helper functions, in your `acquire.py` file.</font>**

```python
# Create helper function to get the necessary connection url.
def get_connection(db, user=user, host=host, password=password):
    '''
    This function uses my info from my env file to
    create a connection url to access the Codeup db.
    '''
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

# Use the above helper function and a sql query in a single function.
def get_db_data():
    '''
    This function reads data from the Codeup db into a df.
    '''
    sql_query = 'write your sql query here; test it in Sequel Pro first!'
    return pd.read_sql(sql_query, get_connection('database_name'))
```

In [17]:
# Let's create a helper function that creates our connection url.

def get_connection(db, user=user, host=host, password=password):
    '''
    This function uses my info from my env file to
    create a connection url to access the Codeup db.
    ''' #triple quotes=a doc string to say what the function is going to do
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

In [18]:
def new_titanic_data():
    '''
    This function reads in the titanic data from the Codeup db
    and returns a pandas DataFrame with all columns.
    '''
    sql_query = 'SELECT * FROM passengers'
    return pd.read_sql(sql_query, get_connection('titanic_db'))

In [20]:
df = new_titanic_data()
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


<hr style="border-top: 10px groove limegreen; margin-top: 1px; margin-bottom: 1px"></hr>

## Caching Data

**<font color=green>Save time by saving your data to a csv file for future use.</font>**

- Caching or storing data you've retrieved from a database or website makes accessing it later much faster. Basically, cached data reduces load times.

- We can design our acquire functions to get our data for us faster by reading in a csv file, if one exists, and if not, acquiring our data and creating a csv file for later use.

- The `os.path.isfile()` method in Python is used to check whether a specified path is an existing file or not. It returns a boolean value.

<hr style="border-top: 10px groove limegreen; margin-top: 1px; margin-bottom: 1px"></hr>

In [21]:
# Let's check to see if a file names 'titanic_df.csv' exists in this directory.

os.path.isfile('titanic_df.csv')

False

In [22]:
# Let's write our 'titanic_df' DataFrame to a csv file.

titanic_df.to_csv('titanic_df.csv') #creates the csv file

In [23]:
# Let's check again...

os.path.isfile('titanic_df.csv')

True

- Let's use this concept to write a new function that allows us to hit the Codeup database, write the data to a csv file for later use, and read the data into a pandas DataFrame the next time we call the function and the csv file exists.

In [24]:
# Here is our first helper function that's used below.

def get_connection(db, user=user, host=host, password=password):
    '''
    This function uses my info from my env file to
    create a connection url to access the Codeup db.
    '''
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

In [25]:
# Let's use our new_titanic_data() function from above as a helper in a final function.

def new_titanic_data():
    '''
    This function reads the titanic data from the Codeup db into a df,
    write it to a csv file, and returns the df.
    '''
    # Create SQL query.
    sql_query = 'SELECT * FROM passengers'
    
    # Read in DataFrame from Codeup db.
    df = pd.read_sql(sql_query, get_connection('titanic_db'))
    
    return df

In [26]:
def get_titanic_data(cached=False):
    '''
    This function reads in titanic data from Codeup database and writes data to
    a csv file if cached == False or if cached == True reads in titanic df from
    a csv file, returns df.
    ''' 
    if cached == False or os.path.isfile('titanic_df.csv') == False:
        
        # Read fresh data from db into a DataFrame.
        df = new_titanic_data()
        
        # Write DataFrame to a csv file.
        df.to_csv('titanic_df.csv')
        
    else:
        
        # If csv file exists or cached == True, read in data from csv.
        df = pd.read_csv('titanic_df.csv', index_col=0)
        
    return df

In [27]:
df = get_titanic_data()
df.head(2)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0


In [28]:
df = get_titanic_data(cached=False)
df.head(2)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0


In [29]:
df = get_titanic_data(cached=True)
df.head(2)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0


In [31]:
#why take the time to write doctrings into my functions?

get_titanic_data()
#shift tab in here for doc string in btwn the ()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.2500,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.9250,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1000,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.0500,S,Third,,Southampton,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.0,0,0,13.0000,S,Second,,Southampton,1
887,887,1,1,female,19.0,0,0,30.0000,S,First,B,Southampton,1
888,888,0,3,female,,1,2,23.4500,S,Third,,Southampton,0
889,889,1,1,male,26.0,0,0,30.0000,C,First,C,Cherbourg,1
