# Exercises

## Set Up

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [3]:
# this is to silence pandas' warnings
import warnings
warnings.simplefilter(action='ignore')
%config InlineBackend.figure_format='svg'

In [4]:
FONT_FAMILY = 'DejaVu Sans'
FONT_SCALE = 1.3

data_dir = '../data'

## Load and preprocess the dataset

The exercises will be based on a dataset called **cereals**. Information about authors and contributors to this dataset can be found [here](https://www.kaggle.com/code/hiralmshah/nutrition-data-analysis-from-80-cereals). 
The dataset has 77 records and 16 columns. The columns are:
- **name** - name of the cereal
- **mfr** -stands for the manufacturer of the cereals. You can find the association of the letter in the dataset with the real name in the `manufacturers_df` we have loaded below.
- **type** - hot or cold, the preferred way of eating
- **calories** - amount of calories per portion
- **fat** - grams of fat
- **sodium** - milligrams of sodium
- **fiber** - amount in grams per portion
- **carbo** - amount of carbohydrates in grams
- **sugars** - amount in gram per portion
- **potass** - amount in milligrams per portion
- **vitamins** - vitamins and minerals (0, 25, 100) in percentage
- **shelf** - shelf they appear in supermarket (1, 2 or 3 from the floor)
- **weight** - weight in ounces for one portion
- **cups** - number of cups per portion
- **rating** - rating of the cereals

In [5]:
# loading main dataset
cereals = pd.read_csv(f'{data_dir}/cereal.csv', sep=',')
cereals.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [6]:
#loading dataset that mas manufacturers letters to their names
manufacturers_df = pd.read_csv(f'{data_dir}manufacturers.csv')
manufacturers_df

Unnamed: 0,letter,company_name
0,A,American Home Food Products
1,G,General Mills
2,K,Kelloggs
3,N,Nabisco
4,P,Post
5,Q,Quaker Oats
6,R,Ralston Purina


In [7]:
cereals_with_mfr_names = pd.merge(cereals, manufacturers_df, 
                                  left_on=cereals.mfr, 
                                  right_on=manufacturers_df.letter)

## Exercises

### Exercise 1

Plot the number of products per manufacturer by displaying the manufacturer's name instead of the letter that appears in the `cereals_df` dataframe. All the data you need is found in the `cereals_with_mfr_names` dataframe. Your task is to visualize the data.

In [8]:
# write your code here

### Exercise 2

Plot the distribution of ratings per company checking at the same time if there are any outliers. You can find the necessary data in the `data` dataframe.

In [9]:
data = cereals_with_mfr_names[['company_name', 'rating']]

In [10]:
# write your code here

### Exercise 3

Find and visualize the ratings per product. You will find the necessary data in the `data` dataframe.

In [11]:
data = cereals[['name', 'rating']].groupby('name').mean().reset_index()

In [12]:
# write your code here

### Exercise 4

Find if there is a correlation between any of the numerical  features we have in the dataset. Again you will find the data needed in the `data_1` and `data_2` dataframes. We have split them so that the output would  be visible and readable. Your task is to pick the correct visualization method and supply the data there.

In [13]:
data_1 = cereals[['calories', 'protein', 'fat', 'sodium', 'rating']]
data_2 = cereals[['fiber', 'carbo', 'sugars', 'potass', 'rating']]

In [14]:
#write your code here for data_1


In [15]:
#write your code here for data_2


### Exercise 5

Your next task is to find and visualize these correlations. As always, the data will be ready for you in the `data` dataframe, you will only have to find the correct visualization method and supply the correct arguments to the function

In [16]:
data = cereals[['fiber', 'potass', 'sugars', 'calories','rating']]

In [17]:
#write your code here

### Exercise 6

Using a scatterplot with a color scale plot the potassium amount to the fiber amount and the rating. The data to be used is ready for you in the `data` dataframe.

In [18]:
data = cereals[['potass', 'fiber', 'rating']]

In [19]:
# write your code here

### Exercise 7

Using a scatterplot with a color scale plot the potassium amount to the fiber amount, the sugar amount and the rating. The data to be used is ready for you in the `data` dataframe. Now we have a forth variable to plot as well.
You might find some useful information [here for matplotlib](https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html#sphx-glr-gallery-lines-bars-and-markers-scatter-with-legend-py) and [here for seaborn](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter).

In [20]:
data = cereals[['potass', 'fiber', 'sugars', 'rating']]

In [21]:
#write your code here