# Cleaning Practice
Let's first practice handling missing values and duplicate data using the `cancer_data_means.csv` file, which you created in a previous section.

In [None]:
# import pandas and load cancer data
import pandas as pd

df = pd.read_csv('cancer_data_means.csv')

# check which columns have missing values with info()
#df.info()
missing_val = df.columns[df.isnull().any()]

In [None]:
# use means to fill in missing values
for col in missing_val:
    col_mean = df[col].mean()
    df[col].fillna(col_mean, inplace=True) 

# confirm your correction with info()
df.info()

In [11]:
# check for duplicates in the data
df.duplicated().sum()

5

In [16]:
# drop duplicates
df.drop_duplicates(inplace=True)


0

In [None]:
# confirm correction by rechecking for duplicates in the data
df.duplicated().sum()

## Renaming Columns
Since we also previously changed our dataset to only include means of tumor features, the "_mean" at the end of each feature seems unnecessary. It just takes extra time to type in our analysis later. Let's come up with a list of new labels to assign to our columns.

In [None]:
# remove "_mean" from column names
new_labels = []
for col in df.columns:
    if '_mean' in col:
        new_labels.append(col[:-5])  # exclude last 6 characters
    else:
        new_labels.append(col)

# new labels for our columns
new_labels

In [None]:
# assign new labels to columns in dataframe
df.columns = new_labels

# display first few rows of dataframe to confirm changes
df.head()

In [None]:
# save this for later
df.to_csv('cancer_data_edited.csv', index=False)