# Inspect r/loseit Challenge Data

Now we can read in the already cleaned file. If you don't have the cleaned data, you will need to run [Find and Clean Loseit Data](clean_loseit_challenge_data.ipynb).

In [19]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import re
import matplotlib.pyplot as plt
import seaborn as sns
import os
color = sns.color_palette()
from IPython.display import display, HTML

%matplotlib inline

We begin by loading in the dataset and look at the counts.

In [20]:
big_df = pd.read_csv('./data/cleaned_and_combined_loseit_challenge_data.csv', index_col=0)
big_df['NSV Text'] = big_df['NSV Text'].astype(str).replace('nan', '')
len(big_df)

7859

Now we want to start looking at some of the statistics for the participants.

In [21]:
display(big_df.sort_values(by='Age').head())
display(big_df.sort_values(by='Age', ascending=False).head())

Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
5993,7/15/2016 13:18:30,Spartan117g,SUNSHINE,1.0,Male,31.0,165.0,181.0,171.0,132.41,0,0,0,,10.0,178.0,3.0,1.657459,30.0
1066,6/2/2017 9:56:52,FattyTeen12,Deadpool,13.0,Unknown,58.0,125.0,90.3,90.0,19.43,1,1,1,To begin a healthy life style,0.3,88.0,2.3,2.547065,766.666667
1758,6/11/2017 13:28:15,Elessar_Inman,Batman,14.0,Unknown,69.0,143.0,137.5,130.0,20.72,1,0,0,To get ripped!,7.5,137.0,0.5,0.363636,6.666667
264,3/23/2018 19:40:15,thehealthymt,Dragon,14.0,Female,66.0,252.0,232.0,229.0,37.44,1,1,0,Fit into L hoodie,3.0,229.9,2.1,0.905172,70.0
3694,1/8/2017 14:26:41,cantthinkofanything3,Snake,14.0,Male,65.0,215.0,207.5,200.0,34.53,1,0,0,Be able to control the amount of food I eat wi...,7.5,200.4,7.1,3.421687,94.666667


Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
899,4/4/2018 13:13:16,AmIaThrowawayTotally,Yeti,100.0,Other,67.0,299.0,299.0,290.0,46.82,0,0,0,,9.0,290.0,9.0,3.010033,100.0
5323,7/11/2018 11:42:20,AmIaThrowAwayTotally,Shadowfax,100.0,Other,67.0,200.0,200.0,190.0,31.32,1,0,0,"Drink more water, take 7000+ steps per day.",10.0,196.5,3.5,1.75,35.0
1006,4/12/2018 12:33:46,patchet44,Cerberus,76.0,Female,68.0,208.0,163.0,157.0,24.78,1,0,0,better health,6.0,159.9,3.1,1.90184,51.666667
4968,7/1/2018 18:29:57,mcculloughronnie75,Shadowfax,76.0,Female,61.0,175.0,166.0,155.0,31.36,1,0,1,So I can fit into smaller and cuter clothes,11.0,162.0,4.0,2.409639,36.363636
2049,1/6/2018 18:33:51,mcculloughronnie75,Hamster,75.0,Female,61.0,175.0,171.0,159.0,32.31,1,0,0,Fit into 12 petite,12.0,168.0,3.0,1.754386,25.0


Here we can see that there are outliers at both the young and old end of the age spectrum. So I will look to see if they ever entered with a different age, and if not I will remove them from any age analysis.

In [22]:
display(big_df[big_df.Username == 'AmIaThrowawayTotally'])
display(big_df[big_df.Username == 'AmIaThrowAwayTotally'])
display(big_df[big_df.Username == 'Spartan117g'])

Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
899,4/4/2018 13:13:16,AmIaThrowawayTotally,Yeti,100.0,Other,67.0,299.0,299.0,290.0,46.82,0,0,0,,9.0,290.0,9.0,3.010033,100.0


Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
5323,7/11/2018 11:42:20,AmIaThrowAwayTotally,Shadowfax,100.0,Other,67.0,200.0,200.0,190.0,31.32,1,0,0,"Drink more water, take 7000+ steps per day.",10.0,196.5,3.5,1.75,35.0


Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
784,3/30/2018 18:04:27,Spartan117g,Phoenix,21.0,Male,67.0,196.0,163.6,160.0,25.62,1,0,0,Feel better,3.6,162.0,1.6,0.977995,44.444444
5993,7/15/2016 13:18:30,Spartan117g,SUNSHINE,1.0,Male,31.0,165.0,181.0,171.0,132.41,0,0,0,,10.0,178.0,3.0,1.657459,30.0


Ok, so we will need to ignore the age 100 entries, but we can replace the age 1 entry -- as well as the height of 31 inches.

In [23]:
big_df.Age.replace(1.0, 19.0, inplace=True)
big_df.Height.replace(31.0, 67.0, inplace=True)
display(big_df[big_df.Username == 'Spartan117g'])
age_df = big_df[big_df.Age < 100]

Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
784,3/30/2018 18:04:27,Spartan117g,Phoenix,21.0,Male,67.0,196.0,163.6,160.0,25.62,1,0,0,Feel better,3.6,162.0,1.6,0.977995,44.444444
5993,7/15/2016 13:18:30,Spartan117g,SUNSHINE,19.0,Male,67.0,165.0,181.0,171.0,132.41,0,0,0,,10.0,178.0,3.0,1.657459,30.0


The next component we will look at is height.

In [24]:
display(big_df.sort_values(by='Height').head())
display(big_df.sort_values(by='Height', ascending=False).head())

Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
7887,4/4/2016 11:40:21,beeckahhh,THUNDERSTORM,23.0,Female,52.0,185.0,168.0,155.0,43.68,1,1,0,Wear a medium in summer dresses,13.0,162.0,6.0,3.571429,46.153846
5431,7/7/2016 15:41:45,Verivus,BLUEBERRY,26.0,Female,52.0,143.0,118.8,110.0,30.89,1,1,1,Running a 5K without a break,8.8,120.6,-1.8,-1.515152,-20.454545
5260,7/9/2018 15:31:35,mossy-pants,2nd Breakfast,21.0,Female,52.0,166.8,157.2,152.0,40.87,1,0,1,increase cardio fitness score on fitbit,5.2,152.4,4.8,3.053435,92.307692
3682,1/6/2017 19:31:50,g-rain,Snake,25.0,Female,52.0,222.0,186.0,176.0,48.36,1,1,1,Improving strength and stamina,10.0,174.0,12.0,6.451613,120.0
4662,6/29/2018 13:38:35,myfamilyworriessome,Frodo and Sam,27.0,Female,52.0,164.2,150.6,140.0,39.15,1,0,0,Fit into my little blue dress for a wedding in...,10.6,144.2,6.4,4.249668,60.377358


Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
8267,10/29/2017 12:21:10,Mercer022,Lynx,22.0,Male,189.0,216.0,216.0,200.0,4.25,1,1,0,To start jogging,16.0,209.0,7.0,3.240741,43.75
8635,11/8/2017 20:18:11,SkyDweller-Entropist,Panda,21.0,Male,82.6,216.0,216.0,200.0,22.26,0,1,0,,16.0,220.0,-4.0,-1.851852,-25.0
7201,4/6/2016 21:26:42,Sportsfreaktony,DUCKLING,25.0,Male,82.0,250.0,230.0,200.0,24.05,0,0,0,,30.0,219.0,11.0,4.782609,36.666667
3513,1/6/2017 10:46:53,nfaber06,Phoenix,28.0,Male,82.0,430.0,398.0,375.0,41.61,1,0,0,Fit in old clothes,23.0,384.0,14.0,3.517588,60.869565
430,3/25/2018 6:50:18,jeffles2,Dragon,42.0,Male,81.0,275.0,245.0,240.0,26.25,1,0,0,Stick to elimination diet,5.0,224.6,20.4,8.326531,408.0


In [25]:
big_df[big_df.Username == 'Mercer022']

Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
1430,6/4/2017 14:13:24,Mercer022,Deadpool,22.0,Unknown,74.4,260.0,229.2,220.0,29.97,1,0,0,To start jogging,9.2,224.8,4.4,1.919721,47.826087
7832,4/1/2016 13:21:19,Mercer022,SEEDLING,21.0,Male,74.0,286.6,260.0,242.0,33.38,0,0,0,,18.0,250.0,10.0,3.846154,55.555556
8267,10/29/2017 12:21:10,Mercer022,Lynx,22.0,Male,189.0,216.0,216.0,200.0,4.25,1,1,0,To start jogging,16.0,209.0,7.0,3.240741,43.75


In [26]:
big_df.Height.replace(189.0, 74.0, inplace=True)
# we also need to fix the BMI, for 74 inches and 216 lbs, bmi = 27.7
big_df['Starting BMI'].replace(4.25, 27.7, inplace=True)
big_df[big_df.Username == 'Mercer022']

Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
1430,6/4/2017 14:13:24,Mercer022,Deadpool,22.0,Unknown,74.4,260.0,229.2,220.0,29.97,1,0,0,To start jogging,9.2,224.8,4.4,1.919721,47.826087
7832,4/1/2016 13:21:19,Mercer022,SEEDLING,21.0,Male,74.0,286.6,260.0,242.0,33.38,0,0,0,,18.0,250.0,10.0,3.846154,55.555556
8267,10/29/2017 12:21:10,Mercer022,Lynx,22.0,Male,74.0,216.0,216.0,200.0,27.7,1,1,0,To start jogging,16.0,209.0,7.0,3.240741,43.75


The next stat we look at is the total weight loss during the challenge..

In [27]:
display(big_df.sort_values(by='Total Challenge Loss').head(15))
display(big_df.sort_values(by='Total Challenge Loss', ascending=False).head(5))

Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
2669,1/7/2018 2:35:44,LeMasterOfSwag,Teacup Pig,21.0,Female,59.8,124.6,104.1,100.3,20.46,1,1,1,See my shoulder bones,3.8,219.97,-115.87,-111.306436,-3049.210526
732,3/29/2018 14:41:57,Kivotheginger,Yeti,21.0,Male,72.0,270.0,164.0,245.0,22.24,1,1,0,Workout 4 times per week,-81.0,268.0,-104.0,-63.414634,128.395062
2857,1/11/2018 4:26:54,avoidsummer,Turtle,33.0,Female,65.0,314.0,289.0,280.0,48.09,1,0,0,To visit the gym on a regular basis and do a m...,9.0,384.0,-95.0,-32.871972,-1055.555556
539,3/26/2018 11:22:00,raelinxovern,Chupacabra,23.0,Male,74.0,420.0,386.8,365.0,49.66,1,1,0,Not have to rest hands on gut while on the phone,21.8,469.0,-82.2,-21.251293,-377.06422
5549,7/15/2016 16:21:29,rtriv85,BUTTERFLY,31.0,Female,65.0,164.0,158.5,139.0,26.37,1,0,1,Start wearing my size 8 & size 10 dresses agai...,19.5,240.0,-81.5,-51.419558,-417.948718
7256,4/12/2016 7:26:31,Jennyy1,FAWN,26.0,Female,59.0,189.0,159.4,150.0,32.19,1,1,1,Loss of belly fat (post preg),9.4,185.5,-26.1,-16.373902,-277.659574
6362,8/19/2017 4:42:53,Bonsai1001,Alien,29.0,Female,67.0,180.0,139.0,129.0,21.8,1,1,1,Losing a jeans size,10.0,158.0,-19.0,-13.669065,-190.0
3278,1/6/2017 11:42:42,cattipotato,Phoenix,22.0,Female,63.0,250.0,220.2,215.0,39.0,1,0,0,Gym 4x/week,5.2,238.0,-17.8,-8.08356,-342.307692
7165,4/3/2016 10:46:43,InputZero,DUCKLING,26.0,Male,69.0,280.0,188.0,175.0,27.76,1,1,1,To fit my old clothea,13.0,205.0,-17.0,-9.042553,-130.769231
5183,7/7/2018 1:05:18,TwerkForGold,Ents,24.0,Female,72.0,360.0,319.5,300.0,43.33,1,1,0,Fit into new gym clothes better,19.5,335.0,-15.5,-4.85133,-79.487179


Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
423,3/25/2018 3:55:15,Fralance,Chupacabra,18.0,Male,71.0,250.0,248.4,230.0,34.64,1,1,0,To get back on track,18.4,111.3,137.1,55.193237,745.108696
619,3/27/2018 14:08:37,rainishamy,Unicorn,44.0,Female,69.0,340.0,309.8,292.0,45.74,1,1,0,need smaller pants,17.8,197.6,112.2,36.216914,630.337079
265,3/23/2018 19:48:41,Mandatech,Phoenix,34.0,Female,66.0,235.0,221.3,215.0,35.71,1,0,0,Zip the boots again,6.3,115.9,105.4,47.627655,1673.015873
244,3/23/2018 18:17:12,alakazam1111,Dragon,17.0,Female,59.0,167.0,141.5,135.0,28.58,0,0,0,,6.5,64.5,77.0,54.416961,1184.615385
907,4/4/2018 23:17:19,helpmeloseit247,Pegasus,26.0,Female,65.0,350.0,335.7,325.0,55.86,1,1,0,Write in my journal every night,10.7,290.0,45.7,13.613345,427.102804


So we need to ignore the 4 entries with a total loss of more than 50 lbs lost and the 6 entries gaining more than 20lbs during the entry.

In [32]:
loss_df = big_df[big_df['Total Challenge Loss'] < 50]
loss_df = loss_df[loss_df['Total Challenge Loss'] > -20]

Challenge Goal Loss

In [33]:
display(big_df.sort_values(by='Challenge Goal Loss').head(5))
display(big_df.sort_values(by='Challenge Goal Loss', ascending=False).head(5))

Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
4620,6/29/2018 12:18:12,Schrodinger_Dog,Rivendell,27.0,Male,70.0,318.0,254.0,345.0,36.44,1,0,0,Go to the gym 5 times a week,-91.0,243.2,10.8,4.251969,-11.868132
732,3/29/2018 14:41:57,Kivotheginger,Yeti,21.0,Male,72.0,270.0,164.0,245.0,22.24,1,1,0,Workout 4 times per week,-81.0,268.0,-104.0,-63.414634,128.395062
7044,4/2/2016 2:40:33,dubmevertigo,DAFFODIL,21.0,Female,65.7,240.0,175.0,191.0,28.5,1,1,1,To fit comfortably in size 14,-16.0,175.0,0.0,0.0,-0.0
7855,4/7/2016 10:23:30,shcamannon,SEEDLING,25.0,Female,67.0,185.0,150.0,166.0,23.49,1,1,0,Less back pain,-16.0,154.0,-4.0,-2.666667,25.0
7403,4/4/2016 16:00:03,thats_ridiculous,HAYFEVER,28.0,Female,64.0,255.0,210.0,224.0,36.04,1,0,1,"Fit comfortably into my ""skinny"" jeans",-14.0,218.0,-8.0,-3.809524,57.142857


Unnamed: 0,Timestamp,Username,Team,Age,Gender,Height,Highest Weight,Starting Weight,Challenge Goal Weight,Starting BMI,Has NSV,Has Food Tracker,Has Activity Tracker,NSV Text,Challenge Goal Loss,Final Weight,Total Challenge Loss,Challenge Percentage Lost,Percent of Challenge Goal
3076,1/8/2017 6:57:31,chelle1976,Monarch,40.0,Female,65.0,215.0,204.0,15.0,33.94,1,1,0,To be able to run longer and with greater ease,189.0,201.0,3.0,1.470588,1.587302
7080,4/2/2016 20:52:40,misskateykates,DAFFODIL,28.0,Female,66.0,199.9,198.0,20.0,31.95,1,1,1,to feel comfortable in shorts,178.0,182.2,15.8,7.979798,8.876404
7541,4/3/2016 15:00:32,Stikki_Lawndart,LADYBUG,25.0,Male,67.0,323.0,320.0,180.0,50.11,1,0,1,Stick to the habits I that help me get healthy.,140.0,302.0,18.0,5.625,12.857143
7486,4/13/2016 23:59:34,Labirynthgrl,LADYBUG,25.0,Female,67.0,281.0,279.0,150.0,43.69,1,1,0,Being able to do an invert on pole,129.0,262.6,16.4,5.878136,12.713178
5843,7/1/2016 13:48:30,Arcadia_Lynch,SUNFLOWER,28.0,Female,67.0,306.0,283.0,175.0,44.32,1,0,0,Be down one clothing size please god.,108.0,278.6,4.4,1.55477,4.074074


Looking at these values, we can see that there seems to be a lot of outliers due to input errors. Trying to remove the outliers would be pretty subjective, so I won't try and remove them. Hopefully looking at something like a bar plot will be useful for seeing what kind of goals most people have for the challenge.

Now that we have fixed some of the imput errors, we can save the data and begin the data analysis and visualization in the [next notebook](analyze_loseit_challenge_data.ipynb).

In [34]:
big_df.to_csv('./data/outlier_fized_loseit_challenge_data.csv')
age_df.to_csv('./data/age_outlier_loseit_challenge_data.csv')
loss_df.to_csv('./data/weight_loss_outlier_loseit_challenge_data.csv')