# Data Cleaning Practice  

Goals
- Practice data cleaning with Python
- Seek concise code
- Clearly document decisions
- Be thorough 

Dataset:
- Alzheimers disease and healthy aging data
- Source: https://catalog.data.gov/dataset/alzheimers-disease-and-healthy-aging-data

In [2]:
### Packages
import numpy as np
# import pandas as pd
import modin.pandas as pd # faster
import os
from janitor import clean_names
os.chdir('C:/Users/WulfN/')

### Read in Data
alz_messy = clean_names(pd.read_csv('./datasets/unclean_data_practice/Alzheimer_s_Disease_and_Healthy_Aging_Data.csv'))
                     
# remove scientific notation
pd.options.display.float_format = '{:20,.2f}'.format

# multitple outputs per cell
%config InteractiveShell.ast_node_interactivity = "all"

alz_messy

Unnamed: 0,rowid,yearstart,yearend,locationabbr,locationdesc,datasource,class,topic,question,data_value_unit,...,stratification2,geolocation,classid,topicid,questionid,locationid,stratificationcategoryid1,stratificationid1,stratificationcategoryid2,stratificationid2
0,BRFSS~2022~2022~42~Q03~TMC01~AGE~RACE,2022,2022,PA,Pennsylvania,BRFSS,Mental Health,Frequent mental distress,Percentage of older adults who are experiencin...,%,...,Native Am/Alaskan Native,POINT (-77.86070029 40.79373015),C05,TMC01,Q03,42,AGE,5064,RACE,NAA
1,BRFSS~2022~2022~46~Q03~TMC01~AGE~RACE,2022,2022,SD,South Dakota,BRFSS,Mental Health,Frequent mental distress,Percentage of older adults who are experiencin...,%,...,Asian/Pacific Islander,POINT (-100.3735306 44.35313005),C05,TMC01,Q03,46,AGE,65PLUS,RACE,ASN
2,BRFSS~2022~2022~16~Q03~TMC01~AGE~RACE,2022,2022,ID,Idaho,BRFSS,Mental Health,Frequent mental distress,Percentage of older adults who are experiencin...,%,...,"Black, non-Hispanic",POINT (-114.36373 43.68263001),C05,TMC01,Q03,16,AGE,65PLUS,RACE,BLK
3,BRFSS~2022~2022~24~Q03~TMC01~AGE~RACE,2022,2022,MD,Maryland,BRFSS,Mental Health,Frequent mental distress,Percentage of older adults who are experiencin...,%,...,"Black, non-Hispanic",POINT (-76.60926011 39.29058096),C05,TMC01,Q03,24,AGE,65PLUS,RACE,BLK
4,BRFSS~2022~2022~55~Q03~TMC01~AGE~GENDER,2022,2022,WI,Wisconsin,BRFSS,Mental Health,Frequent mental distress,Percentage of older adults who are experiencin...,%,...,Male,POINT (-89.81637074 44.39319117),C05,TMC01,Q03,55,AGE,65PLUS,GENDER,MALE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284137,BRFSS~2016~2016~55~Q15~TSC02~AGE~RACE,2016,2016,WI,Wisconsin,BRFSS,Screenings and Vaccines,Colorectal cancer screening,Percentage of older adults who had either a ho...,%,...,"Black, non-Hispanic",POINT (-89.81637074 44.39319117),C03,TSC02,Q15,55,AGE,AGE_OVERALL,RACE,BLK
284138,BRFSS~2017~2017~56~Q45~TOC13~AGE~RACE,2017,2017,WY,Wyoming,BRFSS,Overall Health,Fair or poor health among older adults with ar...,Fair or poor health among older adults with do...,%,...,Hispanic,POINT (-108.1098304 43.23554134),C01,TOC13,Q45,56,AGE,5064,RACE,HIS
284139,BRFSS~2015~2015~56~Q42~TCC04~AGE~RACE,2015,2015,WY,Wyoming,BRFSS,Cognitive Decline,Talked with health care professional about sub...,Percentage of older adults with subjective cog...,%,...,Asian/Pacific Islander,POINT (-108.1098304 43.23554134),C06,TCC04,Q42,56,AGE,AGE_OVERALL,RACE,ASN
284140,BRFSS~2019~2019~54~Q46~TOC10~AGE~RACE,2019,2019,WV,West Virginia,BRFSS,Overall Health,"Disability status, including sensory or mobili...",Percentage of older adults who report having a...,%,...,Hispanic,POINT (-80.71264013 38.6655102),C01,TOC10,Q46,54,AGE,65PLUS,RACE,HIS


In [3]:
alz_messy.info()

<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 284142 entries, 0 to 284141
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   rowid                       284142 non-null  object 
 1   yearstart                   284142 non-null  int64  
 2   yearend                     284142 non-null  int64  
 3   locationabbr                284142 non-null  object 
 4   locationdesc                284142 non-null  object 
 5   datasource                  284142 non-null  object 
 6   class                       284142 non-null  object 
 7   topic                       284142 non-null  object 
 8   question                    284142 non-null  object 
 9   data_value_unit             284142 non-null  object 
 10  datavaluetypeid             284142 non-null  object 
 11  data_value_type             284142 non-null  object 
 12  data_value                  192808 non-null  float64
 13  data_valu

### Currently unable to view data in data viewer. Will be using other methods to view the data

After viewing the data (in RStudio, for now), here are some proposed changes.
- Does a data dictionary exist for this data?
    - https://chronicdata.cdc.gov/Healthy-Aging/Alzheimer-s-Disease-and-Healthy-Aging-Data/hfr9-rurv
- View categorys of the stratified columns
    - Create bar graphs with counts
- State or region indicator for the locationdesc column
- Change data value unit '%' to 'percentage'
- What is the data value alt?
- distinct data footnotes, can many of these be removed?
- how are the low and high confidence variables derived? 
- clean geolocation column, seperate latitude and longitude. 
- class id, question id, location id - what are these? 
- range of year start and year end?
- What do to with NAs for the data values?

- Exercise with class / method 
    - 
- Creating methods to retrieve categories in a dataframe

In [4]:
# This needs to be changed from lists to NumPy - perhaps move to different script

class unique_categories: 

    def __init__(self, dataframe, category_list): 
        self.dataframe = dataframe
        self.category_list = category_list

    def category_dict(self): 
        """
        Create dictionary from dataframe with feature names and distinct 
        categories for that feature name.
        """
        category_df = self.dataframe[self.category_list]
        messy_category_dict = dict({c: category_df[c].unique() for c in category_df})
        
        get_categories = {}

        for key, values in messy_category_dict.items():
            get_categories[key] = values.tolist()

        return get_categories

    def get_num_categories(self): 
        """
        Number of categories per feature given.
        """
        category_dict = self.category_dict()

        num_categories = []

        for key in category_dict.keys():
            num_categories = num_categories + [len(category_dict[key])]

        return num_categories
    
    def category_lists(self): 
        """
        Lists with categories and NAs such that the length of all 
        lists are the same. 
        """
        num_categories = self.get_num_categories()
        category_dict = self.category_dict()

        add_na = []

        for item in num_categories:
            sum_na = [max(num_categories) - item]
            add_na = add_na + sum_na

        for i, key in enumerate(category_dict): 
            category_dict[key].extend(['NA'] * add_na[i]) 

        return category_dict

    def category_count_df(self):
        """
        Dataframe with feature names, number of categories for that feature, 
        and a list of the categories for that feature. 
        """
        category_dict = self.category_dict()

        category_count_df = pd.DataFrame({
            'var_name': category_dict.keys(),
            'num_categories': self.get_num_categories(),
            'categories': list(category_dict.values())
        })

        return category_count_df

    def category_df(self):
        """
        Feature name as a column, categories as values for that feature.
        """
        category_lists = self.category_lists()

        category_df = pd.DataFrame(
            data = list(zip(*category_lists.values())), 
            columns = list(category_lists.keys())
        ) 

        return category_df

In [16]:
# --- Columns can likely be de-duplicated with ID columns
misc = ['class', 'classid', 'topic', 'topicid', 'question', 'questionid', 'data_value_footnote', 'locationid'] 
stratified = [col for col in alz_messy.columns if col.startswith('strati')]

# Consider selecting columns that end with ID, plus a few more. 
category_features = misc + stratified # something is wrong with this list here

In [19]:
# Is any of this valuable - why not create tables or graphs with counts...
category_info = unique_categories(alz_messy, category_features)

In [None]:
category_info.category_df() # takes eternity due to nested loops

In [28]:
# Does year start always equal year end? 
# need to ensure that when grouping by year we are grouping by the same time periods
diff_test = alz_messy[['yearstart', 'yearend']]
diff_test['yeardiff'] = alz_messy['yearend'] - alz_messy['yearstart']

multi_year = diff_test[diff_test['yeardiff'] > 0]
multi_year # 9261 rows have multiple years. This will need to be thought of carefully. 

Unnamed: 0,yearstart,yearend,yeardiff
31703,2019,2022,3
31707,2021,2022,1
31708,2021,2022,1
31709,2021,2022,1
31710,2021,2022,1
...,...,...,...
52022,2019,2022,3
52025,2019,2022,3
52032,2019,2022,3
52033,2019,2022,3


In [None]:
# Feature level line graphs giving count of vategory over time

# Group by year and category by variable

# For each col, group by year and that column

# Can the result be a grid of each of the plots in the list of columns? (Like facetGrid)

# Will there be to many plots for this to be clear? (experiment)

In [None]:
### Template AI Code to refine:
import seaborn as sns

def plot_category_trends_seaborn(df, year_col, columns_to_plot): 
    sns.set(style="whitegrid", palette="muted") # select function other than set
    
    for col in columns_to_plot:
        # Group by year and the categorical column, then count occurrences
        category_counts = df.groupby([year_col, col]).size().reset_index(name='Count')

        # Plot using seaborn's lineplot
        plt.figure(figsize=(10, 6))
        sns.lineplot(data=category_counts, x=year_col, y='Count', hue=col, marker="o")

        # Set titles and labels
        plt.title(f'Trend for {col}', fontsize=16)
        plt.xlabel('Year', fontsize=12)
        plt.ylabel('Count', fontsize=12)
        plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc='upper left')
        plt.tight_layout()
        plt.show()