Throughout the exercises for Regression in Python lessons, you will use the following example scenario: 

As a customer analyst, I want to know who has spent the most money with us over their lifetime. I have monthly charges and tenure, so I think I will be able to use those two attributes as features to estimate total_charges. I need to do this within an average of $5.00 per customer.

The first step will be to acquire and prep the data. Do your work for this exercise in a file named wrangle.py.

Acquire the data using SQL

In [1]:
import warnings 
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from env import user, host, password

def get_db_url(db):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'
    
query='''
SELECT customer_id, monthly_charges, tenure, total_charges 
FROM customers 
WHERE contract_type_id = 3;
'''
df = pd.read_sql(query, get_db_url('telco_churn'))

Sample and summarize the data

In [2]:
df.head()

Unnamed: 0,customer_id,monthly_charges,tenure,total_charges
0,0013-SMEOE,109.7,71,7904.25
1,0014-BMAQU,84.65,63,5377.8
2,0016-QLJIS,90.45,65,5957.9
3,0017-DINOC,45.2,54,2460.55
4,0017-IUDMW,116.8,72,8456.75


In [3]:
df.shape

(1695, 4)

In [4]:
df.describe()

Unnamed: 0,monthly_charges,tenure
count,1695.0,1695.0
mean,60.770413,56.735103
std,34.678865,18.209363
min,18.4,0.0
25%,24.025,48.0
50%,64.35,64.0
75%,90.45,71.0
max,118.75,72.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1695 entries, 0 to 1694
Data columns (total 4 columns):
customer_id        1695 non-null object
monthly_charges    1695 non-null float64
tenure             1695 non-null int64
total_charges      1695 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 53.0+ KB


This finds the odd values in the data. Here we can see that there are 10 null values that need to be fixed.

In [6]:
df.total_charges.value_counts(sort=True).head()

           10
1161.75     2
7334.05     2
3533.6      2
844.45      2
Name: total_charges, dtype: int64

#### 1. Acquire customer_id, monthly_charges, tenure, and total_charges from telco_churn database for all customers with a 2 year contract.

I did this step in the SQL query, but it can be done directly from pandas after acquiring the data.
Example of how to change the data in pandas:

In [None]:
'''
SELECT * FROM customers (SQL query)

my_columns = df_telco[['customer_id', 'monthly_charges', 'tenure', 'total_charges', 'contract_type_id']]
telco = my_columns[my_columns.contract_type_id == 3]
telco.head()
'''

#### 2. Walk through the steps above using your new dataframe. You may handle the missing values however you feel is appropriate.

Using regex to replace odd values with np.nan
- ^ start of the string
- \s white spaces  
- * any characters
- $ end of a string

Then check if there are any null values left in the data

In [7]:
df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
df.isnull().sum()

customer_id         0
monthly_charges     0
tenure              0
total_charges      10
dtype: int64

This tells me that total_charges dtype needs to be changed into a float to go along with the other integer/float columns.

In [9]:
df.columns[df.isnull().any()]

Index(['total_charges'], dtype='object')

Using .astype() to change total_charges column into a float and testing.

In [10]:
df['total_charges'] = df['total_charges'].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1695 entries, 0 to 1694
Data columns (total 4 columns):
customer_id        1695 non-null object
monthly_charges    1695 non-null float64
tenure             1695 non-null int64
total_charges      1685 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 53.0+ KB


Dropping columns with empty values.
Can also fill the empty values with a value instead of dropping using .fillna()

In [11]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1685 entries, 0 to 1694
Data columns (total 4 columns):
customer_id        1685 non-null object
monthly_charges    1685 non-null float64
tenure             1685 non-null int64
total_charges      1685 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 65.8+ KB


In [12]:
df.dtypes

customer_id         object
monthly_charges    float64
tenure               int64
total_charges      float64
dtype: object

#### 3. End with a python file wrangle.py that contains the function, wrangle_telco(), that will acquire the data and return a dataframe cleaned with no missing values.

In [13]:
def wrangle_telco():

    def get_db_url(db):
        return f'mysql+pymysql://{user}:{password}@{host}/{db}'

    query='''
    SELECT customer_id, monthly_charges, tenure, total_charges 
    FROM customers 
    WHERE contract_type_id = 3;
    '''
    df = pd.read_sql(query, get_db_url('telco_churn'))

    df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
    df['total_charges'] = df['total_charges'].astype(float)
    df = df.dropna()
    return df

Find error in third smaller function

In [None]:
# def get_db_url(db):
#     return f'mysql+pymysql://{user}:{password}@{host}/{db}'

# def get_data_from_mysql():
#     query='''
#     SELECT customer_id, monthly_charges, tenure, total_charges 
#     FROM customers 
#     WHERE contract_type_id = 3;
#     '''
#     df = pd.read_sql(query, get_db_url('telco_churn'))
#     return df

# def clean_my_data(df):
#     df = df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
#     df.total_charges = df.total_charges.str.strip().replace('', np.nan).astype(float)
#     df = df.dropna()
#     df = df.drop(columns=['customer_id'])
#     return df

# def wrangle_telco():
#     df = get_data_from_mysql()
#     df = clean_my_data(df)
#     return df