# Exercises

### 1 - 3. Set up repo and .gitignore

#### 1. Make a new repo called classification-exercises on both GitHub and within your codeup-data-science directory. This will be where you do your work for this module.

#### 2. Inside of your local classification-exercises repo, create a file named .gitignore with the following contents:

```
env.py
.DS_Store
.ipynb_checkpoints/
__pycache__
*.csv
```

#### Add and commit your .gitignore file before moving forward.

#### 3. Now that you are 100% sure that your .gitignore file lists env.py, create or copy your env.py file inside of classification-exercises. Running git status should show that git is ignoring this file.

### 4. In a jupyter notebook, classification_exercises.ipynb, use a python module (pydata or seaborn datasets) containing datasets as a source from the iris data. Create a pandas dataframe, df_iris, from this data.

In [1]:
import pandas as pd
from pydataset import data

iris = data('iris')

In [2]:
# 4a. print the first 3 rows

print(iris.head(3))

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
1           5.1          3.5           1.4          0.2  setosa
2           4.9          3.0           1.4          0.2  setosa
3           4.7          3.2           1.3          0.2  setosa


In [5]:
# 4b. print the number of rows and columns (shape)

print(iris.shape)

(150, 5)


In [6]:
# 4c. print the column names

print(iris.columns)

Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
       'Species'],
      dtype='object')


In [12]:
# 4d. print the data type of each column

print(iris.dtypes)

Sepal.Length    float64
Sepal.Width     float64
Petal.Length    float64
Petal.Width     float64
Species          object
dtype: object


In [13]:
# 4e. print the summary statistics for each of the numeric variables

print(iris.describe())

       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


### 5. Read the data from this google sheet into a dataframe, df_google.

In [15]:
url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'

csv_export_url = url.replace('/edit#gid=', '/export?format=csv&gid=')

df_google = pd.read_csv(csv_export_url)

In [16]:
# 5a. print the first 3 rows

print(df_google.head(3))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  


In [18]:
# 5b. print the number of rows and columns

print(df_google.shape)

(891, 12)


In [19]:
#5c. print the column names

print(df_google.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


In [20]:
# 5d. print the data type of each column

print(df_google.dtypes)

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [21]:
# 5e. print the summary statistics for each of the numeric variables

print(df_google.describe)

<bound method NDFrame.describe of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                    Name     Sex   Age  SibSp  \
0                                Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1   
2                                 Heikkinen, Miss. Laina  female  26.0      0   
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                               Allen, Mr. William Henry    male  35.0      0   
..                     

In [23]:
# 5f. print the unique values for each of your categorical variables

print(df_google.describe(include='object'))

                           Name   Sex  Ticket    Cabin Embarked
count                       891   891     891      204      889
unique                      891     2     681      147        3
top     Braund, Mr. Owen Harris  male  347082  B96 B98        S
freq                          1   577       7        4      644


### 6. Download the previous exercise's file into an excel (File → Download → Microsoft Excel). Read the downloaded file into a dataframe named df_excel.

In [24]:
# 6a. assign the first 100 rows to a new dataframe, df_excel_sample

df_excel = pd.read_excel('train.xlsx')

In [26]:
# 6b. print the number of rows of your original dataframe

print(df_excel.shape[0])

891


In [27]:
# 6c. print the first 5 column names

print(df_excel.columns[:5])

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex'], dtype='object')


In [32]:
# 6d. print the column names that have a data type of object

print(df_excel.select_dtypes(include='object').columns)

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')


In [33]:
# 6e. compute the range for each of the numeric variables.

for col in df_excel.select_dtypes(include='number'):
    col_range = df_excel[col].max() - df_excel[col].min()
    print(f"Range for {col}: {col_range}")

Range for PassengerId: 890
Range for Survived: 1
Range for Pclass: 2
Range for Age: 79.58
Range for SibSp: 8
Range for Parch: 6
Range for Fare: 512.3292


### 7. Make a new python module, acquire.py to hold the following data aquisition functions:

In [1]:
import acquire as a

# Make a function named get_titanic_data that returns the titanic data from the codeup data science database as a pandas data frame. 
# Obtain your data from the Codeup Data Science Database.
titanic = a.get_titanic_data()

# Make a function named get_iris_data that returns the data from the iris_db on the codeup data science database as a pandas data frame. 
# The returned data frame should include the actual name of the species in addition to the species_ids. 
# Obtain your data from the Codeup Data Science Database.
iris = a.get_iris_data()

# Make a function named get_telco_data that returns the data from the telco_churn database in SQL. 
# In your SQL, be sure to join contract_types, internet_service_types, payment_types tables with
# the customers table, so that the resulting dataframe contains all the contract, payment, and 
# internet service options. Obtain your data from the Codeup Data Science Database.
telco = a.get_telco_data()

# Once you've got your get_titanic_data, get_iris_data, and get_telco_data functions written, 
# now it's time to add caching to them. To do this, edit the beginning of the function to check 
# for the local filename of telco.csv, titanic.csv, or iris.csv. If they exist, use the .csv file. 
# If the file doesn't exist, then produce the SQL and pandas necessary to create a dataframe,
# then write the dataframe to a .csv file with the appropriate name.

# Notes

In [3]:
import pandas as pd

In your data acquisition function, first check to see if the csv file exists. If it does, read from the csv file, otherwise get the data "fresh".

```python
import os

def get_titanic_data():
    filename = "titanic.csv"

    if os.path.isfile(filename):
        return pd.read_csv(filename)
    else:
        # Create the engine
        engine = create_engine(get_connection('titanic_db'))

        # Read the SQL query into a dataframe
        df = pd.read_sql(text('SELECT * FROM passengers'), engine.connect())

        # Write that dataframe to disk for later. Called "caching" the data for later.
        df.to_file(filename)

        # Return the dataframe to the calling code
        return df  
```