# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [260]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np
# Create a new dataframe from your CSV
df = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")

In [261]:
# Print out any information you need to understand your dataframe
df.info()
#df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


## Missing Data

Try out different methods to locate and resolve missing data.

In [262]:
# Try to find some missing data!
df.isna().value_counts("Title")
# Chose title to look at because it had the largest amount of missing values

Title
False    19676
True      3810
Name: count, dtype: int64

In [263]:
boringHuman = {"Title": "I'm a boring human"}
dfBoringHuman = df.fillna(value=boringHuman)
dfBoringHuman

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,I'm a boring human,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,I'm a boring human,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


In [264]:
df["Rating"].value_counts(dropna=False)

Rating
5    13131
4     5077
3     2871
2     1565
1      842
Name: count, dtype: int64

In [265]:
dfBoringHuman["Title"].value_counts()

#NOTE ABOUT PERSONAL LEARNING: Originally tried to do df.fillna(value=boringHuman).value_counts() and kept not having the filled values included. Learned that I need to actually set that fillna df as its own variable
# and then run value_counts on that new df variable after it's initialized, not on the same line




Title
I'm a boring human                     3810
Love it!                                136
Beautiful                                95
Love                                     88
Love!                                    84
                                       ... 
Full skirt                                1
Not exactly what i expected               1
Flattering and lovely sweater dress       1
Perfect except slip                       1
Cagrcoal shimmer fun                      1
Name: count, Length: 13994, dtype: int64

Did you find any missing data? What things worked well for you and what did not?

In [266]:
# Respond to the above questions here:

#I found the missing data in titles, which is what I expected to find, yes. I renamed them because people who cannot be bothered to put a title on their review are boring. 
# There are other missing data sets in there, but since a number of them are of int/float data types, replacing them with any value might mess with the numbers, and for the object types leaving them as null values feels fine?

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [267]:
# Keep an eye out for outliers!
# Running info again as a refresher of what the columns are
dfBoringHuman.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    23486 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


In [268]:
# Running describe to check the values specifically
dfBoringHuman.describe()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0,23486.0,23486.0
mean,11742.5,918.118709,43.198544,4.196032,0.822362,2.535936
std,6779.968547,203.29898,12.279544,1.110031,0.382216,5.702202
min,0.0,0.0,18.0,1.0,0.0,0.0
25%,5871.25,861.0,34.0,4.0,1.0,0.0
50%,11742.5,936.0,41.0,5.0,1.0,1.0
75%,17613.75,1078.0,52.0,5.0,1.0,3.0
max,23485.0,1205.0,99.0,5.0,1.0,122.0


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [269]:
# Make your notes here:
# First is understanding context of everything, and establishing based on general logic what is and isn't possible within each range
# Unnamed: Is the same value of the index colum and counts sequentially
# Clothing ID: Cannot make any determination since in theory any clothing piece could have any ID number
# Age: A minimum age of 18 and maximum of 99 feels completely plausible
# Rating: All ratings are between 1 and 5 so all feels fine. Is interesting that 50%+ of the ratings are 5-star
# Recommended IND: Not exactly sure what IND is in this case but all numbers seem to be between 1 and 0
# Positive Feedback Count: Now this one actually seems to have outliers -- when 75th percentile is 3 but max is 122, yeah that's crazy. I still do not have a great feel of how this one works mechanically but those numbers are so far beyond the 75th percentile
    # and so many standard deviations away that they seem impossible to achieve. 

In [270]:
# Ran quantile to find out 99th percentile of Positive Feedback Count, and even though it's many std from 75th percentile still it *feels* more natural. However, I remain unsure if that's just inherent to the nature of this column. The lack of documentation on what it means 
    # and how it's counted really does make it difficult to know if these are or aren't true outliers.
dfBoringHuman["Positive Feedback Count"].quantile(.99)

np.float64(26.0)

In [271]:
dfBoringHuman["Positive Feedback Count"].value_counts().sort_values().head(25)
# These values also seem to only show up once so the likelihood of them being mistakes feels higher. But it still feels off because they're all unique values. 

Positive Feedback Count
66     1
52     1
117    1
94     1
77     1
61     1
71     1
84     1
68     1
89     1
87     1
122    1
108    1
54     1
59     1
69     1
50     1
99     1
48     1
93     1
95     1
98     1
64     1
56     1
78     1
Name: count, dtype: int64

In [272]:
# Replacing all Positive Feedback counts above 99th percentile with NaN as they are so far removed from everything else that it seems doable. Again, though, lack of familiarity with how it's actually calculated means this could be wrong. 
    # Similar to how in most f2p games, there is 10-20% of the playerbase with 0 playtime because they install it but never actually open it. 
dfBoringHuman.loc[dfBoringHuman["Positive Feedback Count"] > dfBoringHuman["Positive Feedback Count"].quantile(0.99), "Positive Feedback Count"] = np.nan
dfBoringHuman.describe()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0,23486.0,23255.0
mean,11742.5,918.118709,43.198544,4.196032,0.822362,2.152268
std,6779.968547,203.29898,12.279544,1.110031,0.382216,3.825918
min,0.0,0.0,18.0,1.0,0.0,0.0
25%,5871.25,861.0,34.0,4.0,1.0,0.0
50%,11742.5,936.0,41.0,5.0,1.0,1.0
75%,17613.75,1078.0,52.0,5.0,1.0,3.0
max,23485.0,1205.0,99.0,5.0,1.0,26.0


In [273]:
#Verified that the 99th percentile was chopped off and am still unsure whether this is correct or not but the code worked as it intended

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [274]:
# Look out for unnecessary data!
dfBoringHuman[dfBoringHuman.duplicated() == True]

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name


In [275]:
#After the above cell returned no values, I ran it with False to verify code was working as intended
dfBoringHuman[dfBoringHuman.duplicated() == False]
#This means that there are no duplicate rows in this dataset. However, upon looking at this, it does appear that unnamed: 0 could tecnically be removed as it doesn't appear to be set as the index column

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,I'm a boring human,Absolutely wonderful - silky and sexy and comf...,4,1,0.0,Initmates,Intimate,Intimates
1,1,1080,34,I'm a boring human,Love this dress! it's sooo pretty. i happene...,5,1,4.0,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0.0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0.0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6.0,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0.0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0.0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1.0,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2.0,General,Dresses,Dresses


In [276]:
dfBoringHuman
#ran this as a base command to double check that Unnamed: 0 and index column were indeed duplicates

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,I'm a boring human,Absolutely wonderful - silky and sexy and comf...,4,1,0.0,Initmates,Intimate,Intimates
1,1,1080,34,I'm a boring human,Love this dress! it's sooo pretty. i happene...,5,1,4.0,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0.0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0.0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6.0,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0.0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0.0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1.0,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2.0,General,Dresses,Dresses


In [277]:
#Dropped Unnamed: 0 because it's unnecessary since an index column already exists

dfBoringDroppedUnnamed = dfBoringHuman.drop("Unnamed: 0", axis=1)
dfBoringDroppedUnnamed

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,I'm a boring human,Absolutely wonderful - silky and sexy and comf...,4,1,0.0,Initmates,Intimate,Intimates
1,1080,34,I'm a boring human,Love this dress! it's sooo pretty. i happene...,5,1,4.0,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0.0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0.0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6.0,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...
23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0.0,General Petite,Dresses,Dresses
23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0.0,General Petite,Tops,Knits
23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1.0,General Petite,Dresses,Dresses
23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2.0,General,Dresses,Dresses


In [278]:
#Realize that the existence of the Unnamed column meant that by definition there could be no duplicates if it was a printed to match the index despite everything else existing as normal
dfBoringDroppedUnnamed.loc[dfBoringDroppedUnnamed.duplicated() == True]

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
493,1104,39,I'm a boring human,,5,1,0.0,General,Dresses,Dresses
2959,1094,30,I'm a boring human,,5,1,0.0,General,Dresses,Dresses
4850,829,66,I'm a boring human,,5,1,0.0,General Petite,Tops,Blouses
5671,861,34,I'm a boring human,,5,1,0.0,General,Tops,Knits
5776,868,61,I'm a boring human,,5,1,0.0,General,Tops,Knits
9306,834,70,I'm a boring human,,5,1,0.0,General Petite,Tops,Blouses
9413,1094,39,I'm a boring human,,5,1,0.0,General Petite,Dresses,Dresses
9430,1094,36,I'm a boring human,,5,1,0.0,General Petite,Dresses,Dresses
10787,1078,35,I'm a boring human,,5,1,0.0,General,Dresses,Dresses
14129,862,38,I'm a boring human,,5,1,0.0,General Petite,Tops,Knits


Did you find any unnecessary data in your dataset? How did you handle it?

In [279]:
# Whoa, there actually are duplicates now. Running command with keep=False to verify they are, in fact, true duplicates
dfBoringDroppedUnnamed.loc[dfBoringDroppedUnnamed.duplicated(keep=False) == True]

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
298,1104,39,I'm a boring human,,5,1,0.0,General,Dresses,Dresses
493,1104,39,I'm a boring human,,5,1,0.0,General,Dresses,Dresses
1004,1094,30,I'm a boring human,,5,1,0.0,General,Dresses,Dresses
2739,1094,36,I'm a boring human,,5,1,0.0,General Petite,Dresses,Dresses
2941,829,66,I'm a boring human,,5,1,0.0,General Petite,Tops,Blouses
2959,1094,30,I'm a boring human,,5,1,0.0,General,Dresses,Dresses
3961,895,36,I'm a boring human,,5,1,0.0,General Petite,Tops,Fine gauge
3976,1078,35,I'm a boring human,,5,1,0.0,General,Dresses,Dresses
4850,829,66,I'm a boring human,,5,1,0.0,General Petite,Tops,Blouses
4866,834,70,I'm a boring human,,5,1,0.0,General Petite,Tops,Blouses


In [280]:
# Yep, they are. Dropping Duplicates and running loc again to verify nothing there
dfBoringDropUnnamedAndDupes = dfBoringDroppedUnnamed.drop_duplicates()
dfBoringDropUnnamedAndDupes.loc[dfBoringDropUnnamedAndDupes.duplicated() == True]

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name


## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [281]:
# Look out for inconsistent data!
# Searching through the columns one at a time to see if any values look out of the ordinary
dfBoringDropUnnamedAndDupes["Class Name"].value_counts()
#Nothing in here looks out of place

Class Name
Dresses           6312
Knits             4835
Blouses           3093
Sweaters          1428
Pants             1388
Jeans             1146
Fine gauge        1099
Skirts             945
Jackets            704
Lounge             691
Swim               350
Outerwear          328
Shorts             317
Sleep              228
Legwear            165
Intimates          154
Layering           146
Trend              119
Casual bottoms       2
Chemises             1
Name: count, dtype: int64

In [282]:
dfBoringDropUnnamedAndDupes["Department Name"].value_counts()
#These also look fine

Department Name
Tops        10455
Dresses      6312
Bottoms      3798
Intimate     1735
Jackets      1032
Trend         119
Name: count, dtype: int64

In [283]:
dfBoringDropUnnamedAndDupes["Division Name"].value_counts()
#These likewise look good

Division Name
General           13839
General Petite     8110
Initmates          1502
Name: count, dtype: int64

In [284]:
dfBoringDropUnnamedAndDupes["Positive Feedback Count"].value_counts(dropna=False)
#After running this, all values were converted to float instead of int after running loc, which has to happen because int cannot store non-finite values

Positive Feedback Count
0.0     11155
1.0      4043
2.0      2193
3.0      1433
4.0       922
5.0       673
6.0       525
7.0       374
8.0       319
9.0       261
NaN       231
10.0      225
11.0      178
12.0      146
14.0      121
13.0      102
15.0       94
17.0       81
16.0       74
18.0       62
19.0       54
20.0       40
23.0       31
21.0       30
22.0       29
25.0       25
26.0       23
24.0       21
Name: count, dtype: int64

In [285]:
dfBoringDropUnnamedAndDupes["Recommended IND"].value_counts()
#is legit just int versions of booleans, checks out

Recommended IND
1    19293
0     4172
Name: count, dtype: int64

In [286]:
dfBoringDropUnnamedAndDupes["Rating"].value_counts()
#All ratings check out

Rating
5    13111
4     5076
3     2871
2     1565
1      842
Name: count, dtype: int64

In [None]:
dfBoringDropUnnamedAndDupes["Review Text"].value_counts()
#Interesting that a couple of these are duplicated even after a actual duplicates were removed. Boring humans, indeed. 

Review Text
Perfect fit and i've gotten so many compliments. i buy all my suits from here now!                                                                                                                                                                                                                                                                                                                                                                                                                                        3
I purchased this and another eva franco dress during retailer's recent 20% off sale. i was looking for dresses that were work appropriate, but that would also transition well to happy hour or date night. they both seemed to be just what i was looking for. i ordered a 4 regular and a 6 regular, as i am usually in between sizes. the 4 was definitely too small. the 6 fit, technically, but was very ill fitting. not only is the dress itself short, but it is very short-waisted. i a

In [None]:
dfBoringDropUnnamedAndDupes["Title"].value_counts()
#This also checks out

Title
I'm a boring human                     3789
Love it!                                136
Beautiful                                95
Love                                     88
Love!                                    84
                                       ... 
Full skirt                                1
Not exactly what i expected               1
Flattering and lovely sweater dress       1
Perfect except slip                       1
Cagrcoal shimmer fun                      1
Name: count, Length: 13994, dtype: int64

In [None]:
dfBoringDropUnnamedAndDupes["Age"].value_counts(dropna=False)
#As far as I can tell this is also fine, all ages are ints and as established before are between 18 and 99

Age
39    1267
35     905
36     839
34     802
38     779
      ... 
93       2
99       2
86       2
90       2
92       1
Name: count, Length: 77, dtype: int64

In [None]:
dfBoringDropUnnamedAndDupes["Clothing ID"].value_counts(dropna=False)
#Interesting that so many Clothing IDs are reused but makes sense if a lot of people are buying popular items

Clothing ID
1078    1021
862      802
1094     753
1081     582
872      544
        ... 
181        1
430        1
681        1
217        1
637        1
Name: count, Length: 1206, dtype: int64

In [305]:
#That all said, we might as well run data type conversions just to make sure. 
dfBoringDropUnnamedAndDupes["Clothing ID"] = dfBoringDropUnnamedAndDupes["Clothing ID"].astype(int)
dfBoringDropUnnamedAndDupes["Age"] = dfBoringDropUnnamedAndDupes["Age"].astype(int)
dfBoringDropUnnamedAndDupes["Title"] = dfBoringDropUnnamedAndDupes["Title"].astype(str)
dfBoringDropUnnamedAndDupes["Review Text"] = dfBoringDropUnnamedAndDupes["Review Text"].astype(str)
dfBoringDropUnnamedAndDupes["Rating"] = dfBoringDropUnnamedAndDupes["Rating"].astype(int)
dfBoringDropUnnamedAndDupes["Recommended IND"] = dfBoringDropUnnamedAndDupes["Recommended IND"].astype(int)
dfBoringDropUnnamedAndDupes["Positive Feedback Count"] = dfBoringDropUnnamedAndDupes["Positive Feedback Count"].astype(float)
dfBoringDropUnnamedAndDupes["Division Name"] = dfBoringDropUnnamedAndDupes["Division Name"].astype(str)
dfBoringDropUnnamedAndDupes["Department Name"] = dfBoringDropUnnamedAndDupes["Department Name"].astype(str)
dfBoringDropUnnamedAndDupes["Class Name"] = dfBoringDropUnnamedAndDupes["Class Name"].astype(str)
dfBoringDropUnnamedAndDupes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23465 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Clothing ID              23465 non-null  int64  
 1   Age                      23465 non-null  int64  
 2   Title                    23465 non-null  object 
 3   Review Text              23465 non-null  object 
 4   Rating                   23465 non-null  int64  
 5   Recommended IND          23465 non-null  int64  
 6   Positive Feedback Count  23234 non-null  float64
 7   Division Name            23465 non-null  object 
 8   Department Name          23465 non-null  object 
 9   Class Name               23465 non-null  object 
dtypes: float64(1), int64(4), object(5)
memory usage: 2.0+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfBoringDropUnnamedAndDupes["Clothing ID"] = dfBoringDropUnnamedAndDupes["Clothing ID"].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfBoringDropUnnamedAndDupes["Age"] = dfBoringDropUnnamedAndDupes["Age"].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfBoringDropUnnamedAn

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
# As far as I can tell it all checks out? Maybe there were inconsistencies before but it all feels good now. Would love to learn what I missed if I'm wrong though. 