# Project Outline 
You will perform data analysis for Bellabeat, a high-tech manufacturer of health-focused products for women. You will analyze smart device data to gain insight into how consumers are using their smart devices. Your analysis will help guide future marketing strategies for your team. Along the way, you will perform numerous real-world tasks of a junior data analyst by following the steps of the data analysis proces

To answer this, I will perform an **exploratory data analysis** along with some data from 30 users 5 years ago.

## Data Visualisation

For this project the focus will be on visualising the data. Exploring the data. Learning something. Hopefully this will encourage other Kagglers to think of interesting questions, explore the data. and share their results.

I'll be making extensive use of **maptlotlib & seaborn**.

In [27]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

# Data Engineering

Before I can analyze the data, I first need to do some data engineering. 

I will first create data frames and join the sets together doing some feature engineering to enable easy data visualization and machine learning.

In [28]:
dailyActivity = pd.read_csv('../input/bellabeat-casestudy/dailyActivity_merged.csv')
sleepDay = pd.read_csv('../input/bellabeat-casestudy/sleepDay_merged.csv')

Create a new dataframe containing relavent information and the averages of every users input

In [29]:
df1 = dailyActivity.drop(['LoggedActivitiesDistance', 'SedentaryActiveDistance'], axis = 1)
df1 = df1.groupby(['Id']).mean()

In [30]:
df_sleep = sleepDay.drop(['SleepDay', 'TotalSleepRecords'], axis = 1)
df_sleep = df_sleep.groupby(['Id']).mean()
df_sleep

Create a new data frame dropping all NaN and missing values

In [31]:
df2 = df1
df2["timeAsleep"] = df_sleep['TotalMinutesAsleep']
df2["timeInBed"] = df_sleep['TotalTimeInBed']
df3 = df2.dropna()
df3


Create a new dataframe after identifying and removing outliers

In [32]:
import scipy.stats as stats
z_scores = stats.zscore(df3)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
df4 = df3[filtered_entries]

df4

# Statistical Overview

In [33]:
df4.describe()

### What does this show us?

We see that 21 out of the 30 users data remains. 

However, we must continue our analysis with caution, as our data does take into account a users gender or fitness level, therefore our data is very indiscriminate. Our key take aways here are:
* The median for calories burned is 2330
* The max calories burned per day is 3430 
* The minimum calories burned per day is 1540 

# Data Visualization

Let's begin by identifying correlations between our target value, Calories.

In [35]:
df4.corr()['Calories'].sort_values()

#### What do we see?

The above correlation values suggest the following:

* Very active minuetes is our best indcator of calories
* The more active the user the better the indicator of caloric output
* Sleep and inactivity are our worst indicators 

In [38]:
sns.regplot(x="VeryActiveDistance", y="Calories", data = df4)

In [24]:
sns.regplot(x="LightActiveDistance", y="Calories", data = df4)

In [25]:
sns.pairplot(df4)
sns.plt.show()

The above Scatter plot reveals positive correlations between the following:

* Activity & Calories
* Sleep & Calories
* Time asleep & Time in bed


In [39]:
df_activity = df4[['Calories', 'VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes','SedentaryMinutes']]
sns.pairplot(df_activity)
sns.plt.show()

### So what do we see?

We gain insight into the relation between intensity and time ratio that indicates how many calories were burned
* Those that work at higher intensities burn more calories in less time
* An equivalent number of calories can be attained at lower intensities for more time


#### Now let's look at the distribution of calories burned

In [40]:
sns.kdeplot(df4['Calories'])



#### What does this shows us?

We see here that the majority of users are using around 2,330 calories a day 

#### Model Development using Linear Regression

In [41]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()

X = df4[['TotalDistance']]
Y = df4[['Calories']]
lm.fit(X,Y)
Yhat = lm.predict(X)

lm.score(X,Y)

In [42]:
lm1 = LinearRegression()
X1 = df4[['VeryActiveMinutes']]
Y = df4[['Calories']]
lm1.fit(X1,Y)
Yhat = lm1.predict(X1)

lm1.score(X1,Y)

# Findings



Based on the tests above, it appears that we can accept the null hypothesis that activity has an impact on caloric output 

The company can use this data to improve their app and fitbit tracker by implementing programs that reward high intensity work outs. A feature could be added to the app that allows users to see their predicted caloric output based on intensity and duration on the planned workout. Additionally, the app could create programs that incentivise hitting a certain number of steps for a day. With this program customers with a need for tracking a daily caloric output will be drawn to our product and increase sales. incentivise