Pandas is a library that unifies the most common workflows that data analysts and data scientists previously relied on many different libraries for. Pandas has quickly became an important tool in a data professional's toolbelt and is the most popular library for working with tabular data in Python. Tabular data is any data that can be represented as rows and columns. The CSV files we've worked with in previous missions are all examples of tabular data.

To represent tabular data, pandas uses a custom data structure called a dataframe. A dataframe is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data. The dataframe is similar to the NumPy 2D array but adds support for many features that help you work with tabular data.

One of the biggest advantages that pandas has over NumPy is the ability to store mixed data types in rows and columns. Many tabular datasets contain a range of data types and pandas dataframes handle mixed data types effortlessly while NumPy doesn't. Pandas dataframes can also handle missing values gracefully using a custom object, NaN, to represent those values. A common complaint with NumPy is its lack of an object to represent missing values and people end up having to find and replace these values manually. In addition, pandas dataframes contain axis labels for both rows and columns and enable you to refer to elements in the dataframe more intuitively. Since many tabular datasets contain column titles, this means that dataframes preserve the metadata from the file around the data.

In [None]:
# read_csv()
import pandas
food_info = pandas.read_csv("food_info.csv")
print(type(food_info))

In [None]:
# shape attribute, head() method
print(food_info.head(3)) # head() default is 5 rows
dimensions = food_info.shape
print(dimensions)
num_rows = dimensions[0]
print(num_rows)
num_cols = dimensions[1]
print(num_cols)

first_twenty = food_info.head(20)

In [None]:
# loc[] method
hundredth_row = food_info.loc[99]
print(hundredth_row)

In [None]:
# DataFrame.dtypes
print(food_info.dtypes)

In [None]:
# selecting multiple rows
print("Rows 3, 4, 5 and 6")
print(food_info.loc[3:6])

print("Rows 2, 5, and 10")
two_five_ten = [2,5,10]
print(food_info.loc[two_five_ten])
num_rows = food_info.shape[0]
last_rows = food_info.loc[num_rows-5:num_rows]

In [None]:
# selecting individual columns
# Series object.
ndb_col = food_info["NDB_No"]
print(ndb_col)

# Display the type of the column to confirm it's a Series object.
print(type(ndb_col))

saturated_fat = food_info["FA_Sat_(g)"]
cholesterol = food_info["Cholestrl_(mg)"]

In [None]:
# selecting multiple columns by name
zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]

columns = ["Zinc_(mg)", "Copper_(mg)"]
zinc_copper = food_info[columns]

selenium_thiamin = food_info[["Selenium_(mcg)","Thiamin_(mg)"]]

In [None]:
# columns attribute, tolist() method, string endswith() method
print(food_info.columns)
print(food_info.head(2))
col_names = food_info.columns.tolist()
gram_columns = []

for c in col_names:
    if c.endswith("(g)"):
        gram_columns.append(c)
gram_df = food_info[gram_columns]
print(gram_df.head(3))

**Data Manipulation With Pandas**

In the previous mission, we learned how to explore a pandas DataFrame. In this mission, we'll explore how to manipulate a DataFrame and make transformations to it. 

In [None]:
# import, read file, get column names as list, print first three rows
import pandas as pd

food_info = pd.read_csv("food_info.csv")
col_names = food_info.columns.tolist()
print(col_names)
food_info.head(3)

In [None]:
# Transforming a column
sodium_grams = food_info["Sodium_(mg)"]/1000
sugar_milligrams = food_info["Sugar_Tot_(g)"]*1000

In [None]:
# Performing math with multiple columns
grams_of_protein_per_gram_of_water = food_info["Protein_(g)"]/food_info["Water_(g)"]
milligrams_of_calcium_and_iron = food_info["Calcium_(mg)"] +  food_info["Iron_(mg)"]

In [None]:
# Create a Nutritional Index
weighted_protein = food_info["Protein_(g)"]*2
weighted_fat = food_info["Lipid_Tot_(g)"]*-0.75
initial_rating = weighted_protein + weighted_fat

In [None]:
# Normalizing columns in a data set
print(food_info["Protein_(g)"][0:5])
max_protein = food_info["Protein_(g)"].max()
max_fat = food_info["Lipid_Tot_(g)"].max()

normalized_protein = food_info["Protein_(g)"]/max_protein
normalized_fat = food_info["Lipid_Tot_(g)"]/max_fat

In [None]:
# Creating a new column
food_info["Normalized_Protein"] = normalized_protein
food_info["Normalized_Fat"] = normalized_fat

In [None]:
# Create a normalized nutritional index
food_info["Norm_Nutr_Index"] = 2*food_info["Normalized_Protein"]-0.75*food_info["Normalized_Fat"]

In [1]:
# Sort DataFrame in-place on a column in descending order
food_info.sort_values("Norm_Nutr_Index", inplace=True, ascending=False)

NameError: name 'food_info' is not defined