## 1.introduction and defining business task

### About the company

Bellabeat is a high-tech company that manufactures health-focused smart products.
The Co-founder, Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world.
Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with
knowledge about their own health and habits. 

The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter.



### Business task

The founders know that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth, so the asked to analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, they would like high-level recommendations for how these trends can inform Bellabeat marketing strategy

In other words, we want to analyze data from smart fitness devices in order to unlock potential new growth opportunities for the company which would in turn would help drive its marketing strategy.



### Key Stakeholders

- Urška Sršen: Bellabeat’s cofounder and Chief Creative Office
- Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
- ellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy

## 2.Preparing the data

### About the dataset 

The data is public available on Kaggle and stored in 18 csv files, it explores smart device users daily habits and contains personal fitness tracker from thirty users who consented to the submission of their personal tracker data and includes minute-level output for physical activity, heart rate, and sleep monitoring, it was generated by respondents from a distributed survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016.


### Data Limitation

- Small sample size: The sample size is small relative to the total population of women or number of women who own or use Bellabeat products.

- No Metadata and Demographics: Data on the age, race, socioeconomic status (i.e. marital status) of respondents as well as data such as location, lifestyle, weather, temperature, humidity etc. was no provided. These variables may have an impact on activity/use of Bellabeat products and could provide a deeper understanding for the result.

- Data collection period: The data was collected in 2016 hence, results from the analysis may not be applicable, additionally given that it only covers a period of 31 days, the impact of seasonality may not be accounted for as well other long term changes women may face such as pregnancy.

- Data integrity: Given that the data was obtained from a third party source its integrity my be compromised.



## 3.Processing the data

### Importing the packages that will be used

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Reading the files, organizing and merging them

In [None]:
#list of unique dataframes: [df1, df2, result]

#reading the first dataset
df1 = pd.read_csv('dailyActivity_merged.csv') #shape: (940, 15)

#reading the 2nd dataset
df2 = pd.read_csv('heartrate_seconds_merged.csv') #shape: (2483658, 3)

#reading the 3rd dataset
df3 = pd.read_csv('hourlyCalories_merged.csv') #shape: (22099, 3)

#reading the 4th dataset
df4 = pd.read_csv('hourlyIntensities_merged.csv') #shape: (22099, 4)

#merging df3 and df4 as results

result = pd.merge(df3,df4,how='left')

#reading the 5th dataset
df5 = pd.read_csv('hourlySteps_merged.csv') #shape: (22099, 3)

#merging df5 with df3,df4 as results
result = pd.merge(result, df5, how='left') #shape: (22099, 6)
resultgroubed = result.groupby('Id').sum()

#reading the 6th dataset (data set number 7 is the same but wide)
df6 = pd.read_csv('minuteCaloriesNarrow_merged.csv')
df6groubed = df6.groupby('Id').sum()

#reading the 8th dataset (data set number 9 is the same but wide)
df8 = pd.read_csv('minuteIntensitiesNarrow_merged.csv')
df8groubed = df8.groupby('Id').sum()

#reading the 11th dataset
df11 = pd.read_csv('minuteSleep_merged.csv')

#reading the 12th dataset
df12 = pd.read_csv('minuteStepsNarrow_merged.csv')

#reading the 13th dataset
df13 = pd.read_csv('sleepDay_merged.csv')

#reading the 14th dataset
df14 = pd.read_csv('weightLogInfo_merged.csv')


### Summary about the dataframes 

#### we will be working on these 7 dataframes:
- df1 : it is the daily activity merged
- df2 : it is the heartrate on each five seconds
- result : it is Calories and total intesity and stepTotal on each hour
- df6 : it is the calories each miniute
- df8 : it is the intesity each miniute
- df11 : it is the ammount of sleep each minute 
- df12 : it is the ammount of steps each minute 
- df13: it is the ammount of sleep and bedtime eachday
- df14: it is the weight and BMI measured every day


In [None]:
dflist = [df1, df2, result, df6, df8, df11, df12, df13]

### Cleaning the data

In [None]:
#this line of code, drop the null values and drop the duplicate values in all data frames -except df14- (if there is any):
for i in range(len(dflist)):
    dflist[i].dropna(inplace=True)
    dflist[i].drop_duplicates(inplace=True)
    
#since df14 has many values as none on the Fat coloumns, we are dropping it
df14.drop('Fat', axis=1, inplace=True)

#viewing the number of rows and coloumns of each dataframe:
for i in range(len(dflist)):
    print(dflist[i].shape)

## 3.Analyzing the data

In [None]:
#this line of code gives us a unique view to the dataset, it showes that the first 10 highest Total steps and calories burned
#were in April, and 17 out of first 20 are in April, it seems that as long as the weather is moderate and and not hot, people
#tend to do exercise more
df1.groupby('ActivityDate').sum().sort_values(by='TotalSteps', ascending=False)

#also we can see the correlation between variables in this data set through:
df1.corr()

In [None]:
#in this line, we can see the mean heart rate for each user, and the minimum and maximum heart rate:
df2.groupby('Id').mean()
df2.groupby('Id').mean().max()
df2.groupby('Id').mean().min()

In [None]:
#this lines of code shows us that the most calories are burned in April
result_copy = result.copy()
result_copy['ActivityHour'] = result_copy['ActivityHour'].str[:9]
result_copy.groupby('ActivityHour').sum().sort_values(by='Calories', ascending=False)

#this lines of code shows us that the most calories are burned in evening hours
result_copy2 = result.copy()
result_copy2['ActivityHour'] = result_copy2['ActivityHour'].str[9:]
result_copy2.groupby('ActivityHour').sum().sort_values(by='Calories', ascending=False)

#also this line showes to posivtive correlation between Total Steps and Calories:
result.groupby('Id').sum().plot(x='StepTotal', y='Calories', kind='scatter')

#also this line showes to posivtive correlation between Average Intensity Steps and Calories:
result.groupby('Id').sum().plot(x='AverageIntensity', y='Calories', kind='scatter')



In [None]:
#in these 2 lines of codes we merged the df2('heart rate'), df6('calories'), df8(intesity), df11(sleep), df12(steps)
#df13(bedtime and sleep) so we can see correlation between these variables
merged_1 = df13.groupby('Id').sum().merge(df6groubed, on='Id').merge(df8groubed,on='Id')
result_2 = merged_1.merge(df11.groupby('Id').sum(), on='Id').merge(df12.groupby('Id').sum(), on='Id').merge(df2.groupby('Id').mean(), on='Id')

In [None]:
#these lines of code shows the relationship between the variables and the calories burned
result_2.plot(x='TotalMinutesAsleep', y='Calories', kind='scatter')
result_2.plot(x='Intensity', y='Calories', kind='scatter')
result_2.plot(x='value', y='Calories', kind='scatter')
result_2.plot(x='Steps', y='Calories', kind='scatter')