## Python Version

In [37]:
!python --version

Python 3.6.4 :: Anaconda custom (64-bit)


## Mission
In this mission, we learned how to transform columns, normalize columns, and use the arithmetic operators to create new columns.

## Import data
- Import the pandas libary
- Read food_info.csv into a DataFrame object named food_info.
- Use the DataFrame.columns attribute, followed by the Index.tolist() method, to return a list containing only the column names.
- Assign the resulting list to col_names, and use the print() function to display the value.
- Display the first three rows of food_info.
<br><br>
read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf8" for reading, and generally utf-8 for to_csv.

You can also use the alias 'latin1' instead of 'ISO-8859-1'.

In [13]:
import pandas as pd
food_info = pd.read_csv("food_info.csv", encoding = "latin1")
col_names = food_info.columns.tolist()
print(col_names)
food_info.head(3)

['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_mg)', 'Manganese_(mg)', 'Selenium_(µg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Panto_Acid_mg)', 'Vit_B6_(mg)', 'Folate_Tot_(µg)', 'Folic_Acid_(µg)', 'Food_Folate_(µg)', 'Folate_DFE_(µg)', 'Choline_Tot_ (mg)', 'Vit_B12_(µg)', 'Vit_A_IU', 'Vit_A_RAE', 'Retinol_(µg)', 'Alpha_Carot_(µg)', 'Beta_Carot_(µg)', 'Beta_Crypt_(µg)', 'Lycopene_(µg)', 'Lut+Zea_ (µg)', 'Vit_E_(mg)', 'Vit_D_µg', 'Vit_D_IU', 'Vit_K_(µg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)', 'GmWt_1', 'GmWt_Desc1', 'GmWt_2', 'GmWt_Desc2', 'Refuse_Pct']


Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_K_(µg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct
0,1001,"BUTTER,WITH SALT",15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,7.0,51.368,21.021,3.043,215.0,5.0,"1 pat, (1"" sq, 1/3"" high)",14.2,1 tbsp,0.0
1,1002,"BUTTER,WHIPPED,W/ SALT",16.72,718,0.49,78.3,1.62,2.87,0.0,0.06,...,4.6,45.39,19.874,3.331,225.0,3.8,"1 pat, (1"" sq, 1/3"" high)",9.4,1 tbsp,0.0
2,1003,"BUTTER OIL,ANHYDROUS",0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,8.6,61.924,28.732,3.694,256.0,12.8,1 tbsp,205.0,1 cup,0.0


## Arithmetic operations
**Instruction:**
- Divide the "Sodium_(mg)" column by 1000 to convert the values to grams, and assign the result to sodium_grams.
- Multiply the "Sugar_Tot_(g)" column by 1000 to convert to milligrams, and assign the result to sugar_milligrams.

In [14]:
sodium_grams = food_info["Sodium_(mg)"] / 1000
sugar_milligrams = food_info["Sugar_Tot_(g)"] * 1000

**Instruction:**
- Assign the number of grams of protein per gram of water ("Protein_(g)" column divided by "Water_(g)" column) to grams_of_protein_per_gram_of_water.
- Assign the total amount of calcium and iron ("Calcium_(mg)" column plus "Iron_(mg)" column) to milligrams_of_calcium_and_iron.


In [15]:
grams_of_protein_per_gram_of_water = food_info["Protein_(g)"] / food_info["Water_(g)"]
milligrams_of_calcium_and_iron = food_info["Calcium_(mg)"] + food_info["Iron_(mg)"]

**Instruction:**
- Multiply the "Protein_(g)" column by two, and assign the resulting Series to weighted_protein.
- Multiply the "Lipid_Tot_(g)" column by -0.75, and assign the resulting Series to weighted_fat.
- Add both Series objects together and assign the result to initial_rating.

In [16]:
weighted_protein = food_info["Protein_(g)"] * 2 
weighted_fat = food_info["Lipid_Tot_(g)"] * -0.75
initial_rating = weighted_protein + weighted_fat

## Normalize values
Normalize x is given as:
\begin{equation*}
\ x^1 = \frac{x-min(x)}{max(x)- min(x)}
\end{equation*}

**Instruction:**
- Normalize the values in the "Protein_(g)" column, and assign the result to normalized_protein.
- Normalize the values in the "Lipid_Tot_(g)" column, and assign the result to normalized_fat.

In [18]:
x = food_info["Protein_(g)"]
x_min = x.min()
normalized_protein = (x - x_min)/(x.max()- x_min)

x = food_info["Lipid_Tot_(g)"]
x_min = x.min()
normalized_fat = (x - x_min)/(x.max()- x_min)

## Add columns to DataFrame
**Instruction:**
- Assign the normalized "Protein_(g)" column to a new column named "Normalized_Protein" in food_info.
- Assign the normalized "Lipid_Tot_(g)" column to a new column named "Normalized_Fat" in food_info.


In [19]:
food_info["Normalized_Protein"] = normalized_protein
food_info["Normalized_Fat"] = normalized_fat

In [20]:
food_info.head(3)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct,Normalized_Protein,Normalized_Fat
0,1001,"BUTTER,WITH SALT",15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,21.021,3.043,215.0,5.0,"1 pat, (1"" sq, 1/3"" high)",14.2,1 tbsp,0.0,0.009624,0.8111
1,1002,"BUTTER,WHIPPED,W/ SALT",16.72,718,0.49,78.3,1.62,2.87,0.0,0.06,...,19.874,3.331,225.0,3.8,"1 pat, (1"" sq, 1/3"" high)",9.4,1 tbsp,0.0,0.005548,0.783
2,1003,"BUTTER OIL,ANHYDROUS",0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,28.732,3.694,256.0,12.8,1 tbsp,205.0,1 cup,0.0,0.00317,0.9948


## Manipulate data and assign a new column to DataFrame
Normalize Nutrient Index is given as:
\begin{equation*}
\ x^1 = 2 \times {Normalized Protein} - 0.75 \times Normalized Fat 
\end{equation*}
- Use the Normalized_Protein and Normalized_Fat columns with the formula above to create the Norm_Nutr_Index column.

In [22]:
food_info["Norm_Nutr_Index"] = 2 * normalized_protein - 0.75 * normalized_fat

In [23]:
food_info.head(3)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct,Normalized_Protein,Normalized_Fat,Norm_Nutr_Index
0,1001,"BUTTER,WITH SALT",15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,3.043,215.0,5.0,"1 pat, (1"" sq, 1/3"" high)",14.2,1 tbsp,0.0,0.009624,0.8111,-0.589077
1,1002,"BUTTER,WHIPPED,W/ SALT",16.72,718,0.49,78.3,1.62,2.87,0.0,0.06,...,3.331,225.0,3.8,"1 pat, (1"" sq, 1/3"" high)",9.4,1 tbsp,0.0,0.005548,0.783,-0.576154
2,1003,"BUTTER OIL,ANHYDROUS",0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,3.694,256.0,12.8,1 tbsp,205.0,1 cup,0.0,0.00317,0.9948,-0.739759


## Sort
The DataFrame currently appears in numerical order according to the NDB_No column. NDB_No is a unique USDA identifier that isn't really useful for our needs. To explore which foods rank the highest in the Norm_Nutr_Index column, we need to sort the DataFrame by that column. DataFrame objects have a ```sort_values()``` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) that we can use to sort the entire DataFrame.

To sort the DataFrame on the Sodium_(mg) column, pass in the column name to the DataFrame.sort_values() method, and assign the resulting DataFrame to a new variable:


In [25]:
food_info.sort_values("Sodium_(mg)").head(3)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct,Normalized_Protein,Normalized_Fat,Norm_Nutr_Index
784,4667,"SHORTENING,INDUS,SOY (PART HYDR ) FOR BAKING &...",0.0,884,0.0,100.0,0.0,0.0,0.0,0.0,...,5.4,0.0,12.8,1 tbsp,205.0,1 cup,0.0,0.0,1.0,-0.75
681,4513,"VEGETABLE OIL,PALM KERNEL",0.0,862,0.0,100.0,0.0,0.0,0.0,0.0,...,1.6,0.0,13.6,1 tablespoon,218.0,1 cup,0.0,0.0,1.0,-0.75
680,4511,"OIL,SAFFLOWER,SALAD OR COOKING,HI OLEIC",0.0,884,0.0,100.0,0.0,0.0,0.0,0.0,...,12.82,0.0,13.6,1 tablespoon,218.0,1 cup,0.0,0.0,1.0,-0.75


In [27]:
# Or 
food_info.sort_values(by=["Sodium_(mg)"]).head(3)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct,Normalized_Protein,Normalized_Fat,Norm_Nutr_Index
784,4667,"SHORTENING,INDUS,SOY (PART HYDR ) FOR BAKING &...",0.0,884,0.0,100.0,0.0,0.0,0.0,0.0,...,5.4,0.0,12.8,1 tbsp,205.0,1 cup,0.0,0.0,1.0,-0.75
681,4513,"VEGETABLE OIL,PALM KERNEL",0.0,862,0.0,100.0,0.0,0.0,0.0,0.0,...,1.6,0.0,13.6,1 tablespoon,218.0,1 cup,0.0,0.0,1.0,-0.75
680,4511,"OIL,SAFFLOWER,SALAD OR COOKING,HI OLEIC",0.0,884,0.0,100.0,0.0,0.0,0.0,0.0,...,12.82,0.0,13.6,1 tablespoon,218.0,1 cup,0.0,0.0,1.0,-0.75


In [30]:
# Sort by multiple columns and returns a new DataFrame
food_info.sort_values(by=["Protein_(g)", "Lipid_Tot_(g)"]).tail(3)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct,Normalized_Protein,Normalized_Fat,Norm_Nutr_Index
6204,19177,"GELATINS,DRY PDR,UNSWTND",13.0,335,85.6,0.1,1.3,0.0,0.0,0.0,...,0.01,0.0,7.0,"1 envelope, (1 tbsp)",28.0,"1 package, (1 oz)",0.0,0.969203,0.001,1.937656
5009,16422,"SOY PROT ISOLATE,K TYPE",4.98,321,88.32,0.53,3.58,2.59,0.0,0.0,...,0.299,0.0,28.35,1 oz,,,0.0,1.0,0.0053,1.996025
4858,16122,SOY PROTEIN ISOLATE,4.98,335,88.32,3.39,3.58,0.0,0.0,0.0,...,1.648,0.0,28.35,1 oz,,,0.0,1.0,0.0339,1.974575


In [36]:
# Sort by multiple columns and descending and returns a new DataFrame
food_info.sort_values(by=["Protein_(g)", "Lipid_Tot_(g)"], ascending=False).head(3)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct,Normalized_Protein,Normalized_Fat,Norm_Nutr_Index
4858,16122,SOY PROTEIN ISOLATE,4.98,335,88.32,3.39,3.58,0.0,0.0,0.0,...,1.648,0.0,28.35,1 oz,,,0.0,1.0,0.0339,1.974575
5009,16422,"SOY PROT ISOLATE,K TYPE",4.98,321,88.32,0.53,3.58,2.59,0.0,0.0,...,0.299,0.0,28.35,1 oz,,,0.0,1.0,0.0053,1.996025
6204,19177,"GELATINS,DRY PDR,UNSWTND",13.0,335,85.6,0.1,1.3,0.0,0.0,0.0,...,0.01,0.0,7.0,"1 envelope, (1 tbsp)",28.0,"1 package, (1 oz)",0.0,0.969203,0.001,1.937656


By default, pandas will sort the data by the column we specify in ascending order and return a new DataFrame, rather than modifying food_info itself. To customize the method's behavior, use the parameters listed in the documentation:

In [33]:
# Sorts the DataFrame in-place, rather than returning a new DataFrame.
food_info.sort_values("Sodium_(mg)", inplace=True)

# Sorts by descending order, rather than ascending.
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)

**Instruction:**
Sort the food_info DataFrame in-place on the Norm_Nutr_Index column in descending order.

In [35]:
food_info.sort_values(by=["Norm_Nutr_Index"], inplace=True, ascending=False)
food_info.head(3)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct,Normalized_Protein,Normalized_Fat,Norm_Nutr_Index
5009,16422,"SOY PROT ISOLATE,K TYPE",4.98,321,88.32,0.53,3.58,2.59,0.0,0.0,...,0.299,0.0,28.35,1 oz,,,0.0,1.0,0.0053,1.996025
4858,16122,SOY PROTEIN ISOLATE,4.98,335,88.32,3.39,3.58,0.0,0.0,0.0,...,1.648,0.0,28.35,1 oz,,,0.0,1.0,0.0339,1.974575
6204,19177,"GELATINS,DRY PDR,UNSWTND",13.0,335,85.6,0.1,1.3,0.0,0.0,0.0,...,0.01,0.0,7.0,"1 envelope, (1 tbsp)",28.0,"1 package, (1 oz)",0.0,0.969203,0.001,1.937656
