In [1]:
import pandas as pd
import numpy as np
from pydataset import data

import env
import util
import acquire


### Classification: Acquire data


#### Goals

-Data you wish to use in analysis will be stored in a variety of sources. In this lesson, we will review importing data from a csv and via mySQL, and we will also learn how to import data from our local clipboard, a google sheets document, and from an MS Excel file. We will then select one source to use as we continue through the rest of this module.

#### Methods of Data Acquisition

-read_clipboard: When you have data copied to your clipboard, you can use pandas to read it into a data frame with pd.read_clipboard. This can be useful for quickly transferring data to/from a spreadsheet.

-read_excel: This function can be used to create a data frame based on the contents of an Excel spreadsheet.

-read_csv: Read from a local csv, or from a the cloud (Google Sheets or AWS S3).

-read_sql(sql_query, connection_url): Read data using a SQL query to a database. You must have the required drivers installed, and a specially formatted url string must be provided.

    # To talk to a mysql database:
    python -m pip install pymysql mysql-connector
    #the connection url string:
    mysql+pymysql://USER:PASSWORD@HOST/DATABASE_NAME


1. Use a python module containing datasets as a source from the iris data. Create a pandas dataframe, df_iris, from this data.

    -print the first 3 rows

    -print the number of rows and columns (shape)

    -print the column names

    -print the data type of each column

    -print the summary statistics for each of the numeric variables. Would you recommend rescaling the data based on these statistics?

In [2]:
df_iris = data("iris")

In [3]:
df_iris.head(3)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa


In [4]:
df_iris.columns

Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
       'Species'],
      dtype='object')

In [5]:
df_iris.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
Sepal.Length    150 non-null float64
Sepal.Width     150 non-null float64
Petal.Length    150 non-null float64
Petal.Width     150 non-null float64
Species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 7.0+ KB


In [6]:
df_iris.describe()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


2. Read Table1_CustDetails the excel module dataset, Excel_Exercises.xlsx, into a dataframe, df_excel

    -assign the first 100 rows to a new dataframe, df_excel_sample

    -print the number of rows of your original dataframe

    -print the first 5 column names

    -print the column names that have a data type of object

    -compute the range for each of the numeric variables.



In [7]:
df = pd.read_excel("my_telco_churn.xlsx", sheet_name="Table1_CustDetails")

In [8]:
df_excel_sample = df.head(100)

In [9]:
df.shape[0]

7049

In [10]:
list(df.columns)[:5]

['customer_id', 'gender', 'is_senior_citizen', 'partner', 'dependents']

In [11]:
df.dtypes

customer_id           object
gender                object
is_senior_citizen      int64
partner               object
dependents            object
phone_service          int64
internet_service       int64
contract_type          int64
payment_type          object
monthly_charges      float64
total_charges        float64
tenure               float64
churn                 object
Unnamed: 13          float64
phone_service.1       object
(Multiple Items)      object
dtype: object

In [38]:
df.describe()

Unnamed: 0,is_senior_citizen,phone_service,internet_service,contract_type,monthly_charges,total_charges,tenure,Unnamed: 13
count,7049.0,7049.0,7049.0,7049.0,7049.0,7038.0,7049.0,0.0
mean,0.162009,1.324585,1.222585,0.690878,64.747014,2283.043883,32.379866,
std,0.368485,0.642709,0.779068,0.833757,30.09946,2266.521984,24.595524,
min,0.0,0.0,0.0,0.0,18.25,18.8,0.0,
25%,0.0,1.0,1.0,0.0,35.45,401.5875,8.733456,
50%,0.0,1.0,1.0,0.0,70.35,1397.1,28.683425,
75%,0.0,2.0,2.0,1.0,89.85,3793.775,55.229399,
max,1.0,2.0,2.0,2.0,118.75,8684.8,79.341772,


In [12]:
df.dtypes[df.dtypes == "object"]

customer_id         object
gender              object
partner             object
dependents          object
payment_type        object
churn               object
phone_service.1     object
(Multiple Items)    object
dtype: object

In [43]:
df.dtypes[df.dtypes != "object"]

is_senior_citizen      int64
phone_service          int64
internet_service       int64
contract_type          int64
monthly_charges      float64
total_charges        float64
tenure               float64
Unnamed: 13          float64
dtype: object

3. Read the data from this google sheet into a dataframe, df_google

    -print the first 3 rows

    -print the number of rows and columns

    -print the column names

    -print the data type of each column

    -print the summary statistics for each of the numeric variables

    -print the unique values for each of your categorical variables

#### Testing new funcs

In [14]:
sheet_url = "https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357"

csv_export_url = sheet_url.replace("/edit#gid=", '/export?format=csv&gid=')

df_google = pd.read_csv(csv_export_url)
df_google.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [15]:
df_google.shape

(891, 12)

In [16]:
df_google.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [17]:
df_google.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [18]:
df_google.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [19]:
df_google.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C


In [20]:
# Transform data
df_google = df_google.astype({"Name": "category", "Survived": "category", "Pclass": "category", "Sex": "category", "Cabin": "category", "Ticket": "category", "Embarked": "category"})
df_google.dtypes

PassengerId       int64
Survived       category
Pclass         category
Name           category
Sex            category
Age             float64
SibSp             int64
Parch             int64
Ticket         category
Fare            float64
Cabin          category
Embarked       category
dtype: object

In [21]:
df_google.Survived.unique()

[0, 1]
Categories (2, int64): [0, 1]

In [22]:
df_google.Pclass.unique()

[3, 1, 2]
Categories (3, int64): [3, 1, 2]

In [23]:
df_google.Name.unique()

[Braund, Mr. Owen Harris, Cumings, Mrs. John Bradley (Florence Briggs Thayer), Heikkinen, Miss. Laina, Futrelle, Mrs. Jacques Heath (Lily May Peel), Allen, Mr. William Henry, ..., Montvila, Rev. Juozas, Graham, Miss. Margaret Edith, Johnston, Miss. Catherine Helen "Carrie", Behr, Mr. Karl Howell, Dooley, Mr. Patrick]
Length: 891
Categories (891, object): [Braund, Mr. Owen Harris, Cumings, Mrs. John Bradley (Florence Briggs Thayer), Heikkinen, Miss. Laina, Futrelle, Mrs. Jacques Heath (Lily May Peel), ..., Graham, Miss. Margaret Edith, Johnston, Miss. Catherine Helen "Carrie", Behr, Mr. Karl Howell, Dooley, Mr. Patrick]

In [24]:
df_google.Sex.unique()

[male, female]
Categories (2, object): [male, female]

In [25]:
df_google.Ticket.unique()

[A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450, ..., SOTON/OQ 392076, 211536, 112053, 111369, 370376]
Length: 681
Categories (681, object): [A/5 21171, PC 17599, STON/O2. 3101282, 113803, ..., 211536, 112053, 111369, 370376]

In [26]:
df_google.Cabin.unique()

[NaN, C85, C123, E46, G6, ..., E17, A24, C50, B42, C148]
Length: 148
Categories (147, object): [C85, C123, E46, G6, ..., A24, C50, B42, C148]

In [27]:
df_google.Embarked.unique()

[S, C, Q, NaN]
Categories (3, object): [S, C, Q]

In a new python module, acquire.py:

1. get_titanic_data: returns the titanic data from the codeup data science database as a pandas data frame.

2. get_iris_data: returns the data from the iris_db on the codeup data science database as a pandas data frame. The returned data frame should include the actual name of the species in addition to the species_ids.

In [28]:
acquire.get_titanic_data().head()
    

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [31]:
acquire.get_iris_data().head()

Unnamed: 0,species_id,measurement_id,sepal_length,sepal_width,petal_length,petal_width,species_name
0,1,1,5.1,3.5,1.4,0.2,setosa
1,1,2,4.9,3.0,1.4,0.2,setosa
2,1,3,4.7,3.2,1.3,0.2,setosa
3,1,4,4.6,3.1,1.5,0.2,setosa
4,1,5,5.0,3.6,1.4,0.2,setosa
