# Example Scenario / Goal
## As a customer analyst, I want to know who has spent the most money with us over their lifetime. I have monthly charges and tenure, so I think I will be able to use those two attributes as features to estimate total_charges. I need to do this within an average of $\$$ 5.00 per customer.

## Set up environment
### Import libraries and packages that will be used throughout the project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
sns.set_style=("whitegrid")
import statsmodels.api as sm
import wrangle
import split_scale
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from statsmodels.formula.api import ols
from sklearn.linear_model import LassoCV
import warnings
from sklearn.feature_selection import RFE
warnings.filterwarnings("ignore")

## Acquire & Prep
### The first step is to acquire and prep the data. Work for this exercise is in a file named wrangle.py. This file has already been imported.

In [2]:
# Run the wrangle_telco function from the wrangle python script. 
# This function returns the columns customer_id, tenure, monthly_charges, and total_charges.
# It also cleans the data by removing rows with null values and converint total_charges to type float.
data = wrangle.wrangle_telco()

In [3]:
data.head()

Unnamed: 0,customer_id,tenure,monthly_charges,total_charges
0,0013-SMEOE,71,109.7,7904.25
1,0014-BMAQU,63,84.65,5377.8
2,0016-QLJIS,65,90.45,5957.9
3,0017-DINOC,54,45.2,2460.55
4,0017-IUDMW,72,116.8,8456.75


### Now we separate our data between our target variable and independent variables. Our target variable is total charges and our independent variables are monthly charges and tenure. We will be using tenure and monthly charges to create a model that will predict total charges as accurately as possible.

In [4]:
X = data.drop(columns='total_charges').set_index('customer_id')
y = pd.DataFrame(data.total_charges).set_index(data['customer_id'])

In [5]:
X.head()

Unnamed: 0_level_0,tenure,monthly_charges
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0013-SMEOE,71,109.7
0014-BMAQU,63,84.65
0016-QLJIS,65,90.45
0017-DINOC,54,45.2
0017-IUDMW,72,116.8


In [6]:
y.head()

Unnamed: 0_level_0,total_charges
customer_id,Unnamed: 1_level_1
0013-SMEOE,7904.25
0014-BMAQU,5377.8
0016-QLJIS,5957.9
0017-DINOC,2460.55
0017-IUDMW,8456.75


## Split Data
### Here we split the data between train and test. 80% is used for training our models and the other 20% is used for testing our models. The python script split_scale contains function split_my_data which is used to split data. The script also contains functions for scaling the feature data (X) in a variety of ways. Scaling the data and standardizing the features will prevent variables from dominating simply based on their scale. Unscaled data can result in a disproportionate effect of some data points over others.

In [7]:
X_train, X_test = split_scale.split_my_data(X)

In [8]:
y_train, y_test = split_scale.split_my_data(y)