# Pandas 


Pandas is an open source library. It provides a high performance data structure to represent tabular data (data represented as rows and columns) called a dataframe and data analysis tools.

As NumPy 2D array, Pandas dataframe is a 2-dimensional data structure that allows us:
* To load in data from different files: `.txt`, `.csv`, `.xml`, `.html`, `.xls`, etc.
* To prepare, explore, perform operations, visualise, and analyse data.

Advantages of Pandas dataframe over Numpy 2D array:
* It can store mixed data types
* we can refer to the elements contained in a dataframe by using labels for rows and columns. Dataframes preserve the metadata from the file around the data (column titles stands for labels).
* It can easily handle missing values: `NaN` stands for missing values.
 


To use the Pandas library, we need to import it.
We can then, load a file in a dataframe: the dataset corresponds to SR27 (Excel format exported as `csv`) https://www.ars.usda.gov/northeast-area/beltsville-md/beltsville-human-nutrition-research-center/nutrient-data-laboratory/docs/usda-national-nutrient-database-for-standard-reference/

In [2]:
import pandas

food_info = pandas.read_csv('data/food_info.csv')
first_rows = food_info.head()
print(first_rows)

   NDB_No                 Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
0    1001          BUTTER,WITH SALT      15.87         717         0.85   
1    1002  BUTTER,WHIPPED,WITH SALT      15.87         717         0.85   
2    1003      BUTTER OIL,ANHYDROUS       0.24         876         0.28   
3    1004               CHEESE,BLUE      42.41         353        21.40   
4    1005              CHEESE,BRICK      41.11         371        23.24   

   Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  \
0          81.11     2.11            0.06           0.0           0.06   
1          81.11     2.11            0.06           0.0           0.06   
2          99.48     0.00            0.00           0.0           0.00   
3          28.74     5.11            2.34           0.0           0.50   
4          29.68     3.18            2.79           0.0           0.51   

      ...      Vit_K_(µg)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  \
0     ...             7.0      51.368

In [3]:
dimensions = food_info.shape
print(dimensions)

(8618, 53)


In [4]:
# The number of rows, 8618.
num_rows = dimensions[0]
# The number of columns, 53.
num_cols = dimensions[1]

To extract the first three rows using index slicing: 

In [5]:
print(food_info.loc[0:2])

   NDB_No                 Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
0    1001          BUTTER,WITH SALT      15.87         717         0.85   
1    1002  BUTTER,WHIPPED,WITH SALT      15.87         717         0.85   
2    1003      BUTTER OIL,ANHYDROUS       0.24         876         0.28   

   Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  \
0          81.11     2.11            0.06           0.0           0.06   
1          81.11     2.11            0.06           0.0           0.06   
2          99.48     0.00            0.00           0.0           0.00   

      ...      Vit_K_(µg)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  \
0     ...             7.0      51.368       21.021        3.043   
1     ...             7.0      50.489       23.426        3.012   
2     ...             8.6      61.924       28.732        3.694   

   Cholestrl_(mg)  GmWt_1                  GmWt_Desc1  GmWt_2  GmWt_Desc2  \
0           215.0     5.0  1 pat,  (1" sq, 1/3" high)   

When accessing an individual row, pandas returns a Series object containing the column names and the corresponding values for this row. 

In [6]:
print(food_info.loc[0])

NDB_No                                     1001
Shrt_Desc                      BUTTER,WITH SALT
Water_(g)                                 15.87
Energ_Kcal                                  717
Protein_(g)                                0.85
Lipid_Tot_(g)                             81.11
Ash_(g)                                    2.11
Carbohydrt_(g)                             0.06
Fiber_TD_(g)                                  0
Sugar_Tot_(g)                              0.06
Calcium_(mg)                                 24
Iron_(mg)                                  0.02
Magnesium_(mg)                                2
Phosphorus_(mg)                              24
Potassium_(mg)                               24
Sodium_(mg)                                 643
Zinc_(mg)                                  0.09
Copper_mg)                                    0
Manganese_(mg)                                0
Selenium_(µg)                                 1
Vit_C_(mg)                              

Extracting specific rows at indexes 1, 3, and 10:

In [7]:
food_info.loc[[1, 3, 10]]

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_K_(µg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct
1,1002,"BUTTER,WHIPPED,WITH SALT",15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,7.0,50.489,23.426,3.012,219.0,3.0,"1 pat, (1"" sq, 1/3"" high)",9.4,1 tbsp,0.0
3,1004,"CHEESE,BLUE",42.41,353,21.4,28.74,5.11,2.34,0.0,0.5,...,2.4,18.669,7.778,0.8,75.0,28.0,1 oz,17.0,1 cubic inch,0.0
10,1011,"CHEESE,COLBY",38.2,394,23.76,32.11,3.36,2.57,0.0,0.52,...,2.7,20.218,9.28,0.953,95.0,132.0,"1 cup, diced",113.0,"1 cup, shredded",0.0


To select the last 3 rows of food_info:

In [8]:
food_info.loc[num_rows-3:num_rows]

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_K_(µg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),GmWt_1,GmWt_Desc1,GmWt_2,GmWt_Desc2,Refuse_Pct
8615,90480,"SYRUP,CANE",26.0,269,0.0,0.0,0.86,73.14,0.0,73.2,...,0.0,0.0,0.0,0.0,0.0,21.0,1 serving,,,0.0
8616,90560,"SNAIL,RAW",79.2,90,16.1,1.4,1.3,2.0,0.0,0.0,...,0.1,0.361,0.259,0.252,50.0,85.0,3 oz,,,0.0
8617,93600,"TURTLE,GREEN,RAW",78.5,89,19.8,0.5,1.2,0.0,0.0,0.0,...,0.1,0.127,0.088,0.17,50.0,85.0,3 oz,,,0.0


To access a column using its label creates a Series object:

In [9]:
food_info['Shrt_Desc']

0                                        BUTTER,WITH SALT
1                                BUTTER,WHIPPED,WITH SALT
2                                    BUTTER OIL,ANHYDROUS
3                                             CHEESE,BLUE
4                                            CHEESE,BRICK
5                                             CHEESE,BRIE
6                                        CHEESE,CAMEMBERT
7                                          CHEESE,CARAWAY
8                                          CHEESE,CHEDDAR
9                                         CHEESE,CHESHIRE
10                                           CHEESE,COLBY
11                    CHEESE,COTTAGE,CRMD,LRG OR SML CURD
12                            CHEESE,COTTAGE,CRMD,W/FRUIT
13       CHEESE,COTTAGE,NONFAT,UNCRMD,DRY,LRG OR SML CURD
14                       CHEESE,COTTAGE,LOWFAT,2% MILKFAT
15                       CHEESE,COTTAGE,LOWFAT,1% MILKFAT
16                                           CHEESE,CREAM
17            

In [10]:
saturated_fat = food_info["FA_Sat_(g)"]
cholesterol = food_info["Cholestrl_(mg)"]
type(cholesterol)

pandas.core.series.Series

How to select some columns and assign them to a new dataframe:

In [11]:
selenium_thiamin = food_info[['Selenium_(µg)', 'Thiamin_(mg)']]
selenium_thiamin.head(3)

Unnamed: 0,Selenium_(µg),Thiamin_(mg)
0,1.0,0.005
1,1.0,0.005
2,0.0,0.001


In [12]:
food_info.columns.tolist()

['NDB_No',
 'Shrt_Desc',
 'Water_(g)',
 'Energ_Kcal',
 'Protein_(g)',
 'Lipid_Tot_(g)',
 'Ash_(g)',
 'Carbohydrt_(g)',
 'Fiber_TD_(g)',
 'Sugar_Tot_(g)',
 'Calcium_(mg)',
 'Iron_(mg)',
 'Magnesium_(mg)',
 'Phosphorus_(mg)',
 'Potassium_(mg)',
 'Sodium_(mg)',
 'Zinc_(mg)',
 'Copper_mg)',
 'Manganese_(mg)',
 'Selenium_(µg)',
 'Vit_C_(mg)',
 'Thiamin_(mg)',
 'Riboflavin_(mg)',
 'Niacin_(mg)',
 'Panto_Acid_mg)',
 'Vit_B6_(mg)',
 'Folate_Tot_(µg)',
 'Folic_Acid_(µg)',
 'Food_Folate_(µg)',
 'Folate_DFE_(µg)',
 'Choline_Tot_ (mg)',
 'Vit_B12_(µg)',
 'Vit_A_IU',
 'Vit_A_RAE',
 'Retinol_(µg)',
 'Alpha_Carot_(µg)',
 'Beta_Carot_(µg)',
 'Beta_Crypt_(µg)',
 'Lycopene_(µg)',
 'Lut+Zea_ (µg)',
 'Vit_E_(mg)',
 'Vit_D_µg',
 'Vit_D_IU',
 'Vit_K_(µg)',
 'FA_Sat_(g)',
 'FA_Mono_(g)',
 'FA_Poly_(g)',
 'Cholestrl_(mg)',
 'GmWt_1',
 'GmWt_Desc1',
 'GmWt_2',
 'GmWt_Desc2',
 'Refuse_Pct']

To create a new dataframe containing only the column names that end with "(g)".

In [13]:
gram_columns = [i for i in food_info.columns.tolist() if i.endswith("(g)")]
gram_df = food_info[gram_columns]
gram_df.head(3)

Unnamed: 0,Water_(g),Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g)
0,15.87,0.85,81.11,2.11,0.06,0.0,0.06,51.368,21.021,3.043
1,15.87,0.85,81.11,2.11,0.06,0.0,0.06,50.489,23.426,3.012
2,0.24,0.28,99.48,0.0,0.0,0.0,0.0,61.924,28.732,3.694


Normalisation on Series:

In [14]:
max_protein = food_info["Protein_(g)"].max()
normalized_protein = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()
normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()

How to add normalised results in food_info:

In [15]:
food_info["Normalized_Protein"] = normalized_protein
food_info["Normalized_Fat"] = normalized_fat

Calculate the nutritional index:

In [16]:
food_info["Norm_Nutr_Index"] = 2 * food_info["Normalized_Protein"] - 0.75 * food_info["Normalized_Fat"]

In [17]:
print(food_info[['Shrt_Desc', "Norm_Nutr_Index"]])

                                              Shrt_Desc  Norm_Nutr_Index
0                                      BUTTER,WITH SALT        -0.589077
1                              BUTTER,WHIPPED,WITH SALT        -0.589077
2                                  BUTTER OIL,ANHYDROUS        -0.739759
3                                           CHEESE,BLUE         0.269051
4                                          CHEESE,BRICK         0.303668
5                                           CHEESE,BRIE         0.262282
6                                      CHEESE,CAMEMBERT         0.266420
7                                        CHEESE,CARAWAY         0.351199
8                                        CHEESE,CHEDDAR         0.290734
9                                       CHEESE,CHESHIRE         0.299712
10                                         CHEESE,COLBY         0.297218
11                  CHEESE,COTTAGE,CRMD,LRG OR SML CURD         0.219562
12                          CHEESE,COTTAGE,CRMD,W/F

To explore which foods rank the highest in the Norm_Nutr_Index column, we need to sort the DataFrame by that column.

In [18]:
food_info.sort_values("Norm_Nutr_Index", inplace = True, ascending = False)

In [19]:
print(food_info[['Shrt_Desc', "Norm_Nutr_Index"]])

                                              Shrt_Desc  Norm_Nutr_Index
4991           SOY PROT ISOLATE,K TYPE,CRUDE PROT BASIS         1.996025
6155                           GELATINS,DRY PDR,UNSWTND         1.937656
216              EGG,WHITE,DRIED,STABILIZED,GLUCOSE RED         1.912840
124          EGG,WHITE,DRIED,PDR,STABILIZED,GLUCOSE RED         1.865642
8152   SEAL,BEARDED (OOGRUK),MEAT,DRIED (ALASKA NATIVE)         1.853221
151                                     EGG,WHITE,DRIED         1.836504
4990                            SOY PROT ISOLATE,K TYPE         1.823244
4833                                SOY PROTEIN ISOLATE         1.801794
4200                        BEVERAGES,PROT PDR WHEY BSD         1.757548
123       EGG,WHITE,DRIED,FLAKES,STABILIZED,GLUCOSE RED         1.741548
8234     STEELHEAD TROUT,DRIED,FLESH (SHOSHONE BANNOCK)         1.689324
8611                                 VITAL WHEAT GLUTEN         1.688118
8117            WHALE,BELUGA,MEAT,DRIED (ALASKA NAT