# Food Files

## Introduction
In this exercise, you will combine skills including File I/O, Lists, and Dictionary lessons in a practical example. Several of the files contain nutritional data created from a column-orientated database.

Each entry links by position in the file to the foods column. The foods column was the key, but something happened to cause issues in the data transfer. Before coding each part, be sure to pseudocode or create a flowchart for the needed steps.

## Part 1: Read in the Data
Read in the provided data from multiple files using one of the tools explained in the File I/O lesson and place these into Lists. Look at the files before continuing.

To facilitate reading in multiple data files, we will created a function `DataReadIn()` to read in specific files.

**In function**
* Open the file in read mode
* load the data to local variable
* close the file
* return the loaded data

In [1]:
#Data read in function

def DataReadIn(filename):
    '''
        Takes txt file located at filename and extracts the data from file
    '''
    f = open(filename + '.txt')
    raw_data = f.read()
    f.close()
    data = raw_data.splitlines()
    return data

In [2]:
food_data = DataReadIn('foods')
highfiber_data = DataReadIn('highfiber')
lowfat_data = DataReadIn('lowfat')
low_glycemic_index_data = DataReadIn('low-glycemic-index')

Let's take a look at the data

In [3]:
food_data

['foods',
 'Donut',
 'Carrot',
 'Strawberry',
 'Doritos: Cool Ranch',
 'Pasta',
 'Blue Berry Muffin',
 'Strawberry Smoothie',
 'Chocolate Milk',
 'Protein Bar',
 'Orange Juice',
 'Greek Salad',
 '',
 'Takis',
 'Popcorn',
 'salmon2',
 'pizza rolls',
 'canteloupe',
 'potatoes',
 'watermelon',
 'oatmeal',
 'Slim Jims',
 'Brussel Sprouts',
 'Lasagna',
 'Fried Chicken',
 'Pizza Rolls',
 'Bacon',
 'French Fries',
 'Skim Milk',
 'Green Beans',
 'ja&ng',
 'Doritos: Nacho Cheese',
 'Hot Sauce',
 'Sriacha']

In [4]:
highfiber_data

['high fiber',
 'no',
 'yes ',
 'yes ',
 'nO',
 'no',
 'no',
 'no',
 'no',
 'yes ',
 'no',
 'no',
 '',
 'no',
 'yes ',
 'yes  ',
 'no',
 'yes  ',
 'no',
 'yes  ',
 'yes',
 'yEs',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'no',
 'no',
 'yes  ',
 '23314',
 'no',
 'No',
 'yes  ']

In [5]:
lowfat_data

['low fat',
 'No',
 'Yes',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'No',
 'yes',
 'no',
 'no',
 '',
 'no',
 'no',
 'yes',
 'no',
 'yes',
 'yes ',
 'yes ',
 'yes',
 'no',
 'yes ',
 'no',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'yes ',
 'sa2e ',
 'no',
 'yeS ',
 'yes ']

In [6]:
lowfat_data

['low fat',
 'No',
 'Yes',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'No',
 'yes',
 'no',
 'no',
 '',
 'no',
 'no',
 'yes',
 'no',
 'yes',
 'yes ',
 'yes ',
 'yes',
 'no',
 'yes ',
 'no',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'yes ',
 'sa2e ',
 'no',
 'yeS ',
 'yes ']

## Part 2: Clean Your Data
The data in the files have one or more of the following issues:
* Duplicate data
* Missing data
* Corrupted data
Please do not change the data files. Read the files into Python data structures and then clean them using Python. You may not use Panda for this exercise. You may delete rows missing data, but you must make sure you delete the same position in each of your four lists. Create a list of Dictionaries from these shared lists.

Inside of each dictionary, create a key-value pair. The key will be the first row of the file(which also would be the header), while the value will be the corresponding line number position in the file.

While scanning the lists above, you may have noticed that:
* `food_data` has missing data, duplicates, inconsistent casing, and some corrupted data
* `highfiber_data` has missing data, inconsistent casing, some corrupted data, extra whitespace in strings (, and duplicates judging from `food-data`)
* `lowfat_data` has missing data, inconsistent casing, some corrupted data, extra whitespace in strings (, and duplicates judging from `food_data`)
* `low_glycemic_index_data` has missing data, inconsistent casing, and corrupted data

We will set up a workflow to clean the data, starting with the common issues amongst the lists

1. Whitespace: Define a function that takes list and returns said list with elements not having extra whitespace

2. Character casing: Define a function that takes list and specified casing, and returns said list with elements adjusted to specified casing.

3. Missing data: Create a function that takes in a list and finds where data is missing then removes that element-space/position for the fed-in list and those associated (e.g. takes in `food_data` and finds missing data at position i, removes element in position i from `food_data`, `highfiber_data`, `lowfat_data`, `low_glycemic_index_data`)

4. Corrupted data: Create a function that takes in a list and finds where data is corrupted then removes that element-space/position for the fed-in list and those associated (e.g. takes in `food_data` and finds corrupted data at position j, removes element in position j from `food_data`, `highfiber_data`, `lowfat_data`, `low_glycemic_index_data`)

5. Duplicate data: Create a function that finds a duplicate entry of an element in a list by counting how often each element appears and removing those that appear more than once along with the associated positions in other lists

**Note:** Can combine 3 and 4 into single function

In [7]:
#Whitespace removal function

def RemoveWhitespace(data_list):
    for i, item in enumerate(data_list[1:]):
        data_list[i+1] = item.strip()
    return data_list

In [8]:
ws_food_data = RemoveWhitespace(food_data)
ws_highfiber_data = RemoveWhitespace(highfiber_data)
ws_lowfat_data = RemoveWhitespace(lowfat_data)
ws_low_glycemic_index_data = RemoveWhitespace(low_glycemic_index_data)

In [9]:
ws_highfiber_data

['high fiber',
 'no',
 'yes',
 'yes',
 'nO',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'no',
 '',
 'no',
 'yes',
 'yes',
 'no',
 'yes',
 'no',
 'yes',
 'yes',
 'yEs',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'no',
 'no',
 'yes',
 '23314',
 'no',
 'No',
 'yes']

In [10]:
#Character casing function

def CharacterCasing(data_list, casing='title'):
    '''
        data_list: list
        casing: str
        
        takes in data_list and applies element-wise casing; current options are title and lower
    '''
    
    if casing == 'title':
        for i, item in enumerate(data_list[1:]):
            data_list[i+1] = item.title()
    elif casing == 'lower':
        for i, item in enumerate(data_list[1:]):
            data_list[i+1] = item.lower()
    return data_list

In [11]:
c_food_data = CharacterCasing(ws_food_data, casing='title')
c_highfiber_data = CharacterCasing(ws_highfiber_data, casing='lower')
c_lowfat_data = CharacterCasing(ws_lowfat_data, casing='lower')
c_low_glycemic_index_data = CharacterCasing(ws_low_glycemic_index_data, casing='lower')

In [12]:
c_highfiber_data

['high fiber',
 'no',
 'yes',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'no',
 '',
 'no',
 'yes',
 'yes',
 'no',
 'yes',
 'no',
 'yes',
 'yes',
 'yes',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'no',
 'no',
 'yes',
 '23314',
 'no',
 'no',
 'yes']

In [13]:
#Missing/Corrupted data cleaning function

def RemoveMissCorrData(data_list, list1=None, list2=None, list3=None):
    '''
        data_list: list
        list1: list or None
        
        takes in data_list and identifies index where the element is not alphabetic.
        Then uses said index to delete entries in data_list and associated list1, list2, etc.
        If list1, etc. not specified, defaults to None
    '''
    
    indices = []
    for i, item in enumerate(data_list[1:]):
        if not item.isalpha():
            print(data_list[i+1])
            indices.append(i+1)
            
    indices.reverse()
    for j in indices:
        print(data_list[j])
        del list3[j]
        del list2[j]
        del list1[j]
        del data_list[j]
        

In [14]:
RemoveMissCorrData(c_highfiber_data, list1=c_food_data, list2=c_lowfat_data, list3=c_low_glycemic_index_data)


23314
23314



In [15]:
print(len(c_food_data))
print(len(c_highfiber_data))
print(len(c_lowfat_data))
print(len(c_low_glycemic_index_data))

32
32
32
32


In [16]:
c_low_glycemic_index_data

['low glycemic index',
 'no',
 'yes',
 'yes',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'no',
 'no',
 'no',
 'no',
 'no',
 'yes',
 'no',
 'yes',
 'yes']

In [17]:
#Dictionary

