# Day 2: Get Ready for Pandas


## Non-unique Elements

You are given a non-empty list of integers (X). For this task, you should return a list consisting of only the non-unique elements in this list. To do so you will need to remove all unique elements (elements which are contained in a given list only once). When solving this task, do not change the order of the list. Example: [1, 2, 3, 1, 3] 1 and 3 non-unique elements and result will be [1, 3, 1, 3].

### Input
A list of integers.

### Output
The list of integers.

### Example

```python
checkio([1, 2, 3, 1, 3]) == [1, 3, 1, 3]
checkio([1, 2, 3, 4, 5]) == []
checkio([5, 5, 5, 5, 5]) == [5, 5, 5, 5, 5]
checkio([10, 9, 10, 10, 9, 8]) == [10, 9, 10, 10, 9]
```

### Precondition

```python
0 < len(data) < 1000
```

### Solution

In [181]:
def non_unique(data: list):
    result = []
    
    for number in data:
        if data.count(number) > 1:
            result.append(number)
            
    return result

In [182]:
non_unique([1, 2, 3, 1, 3]) == [1, 3, 1, 3]

True

In [183]:
non_unique([1, 2, 3, 4, 5]) == []

True

In [184]:
non_unique([5, 5, 5, 5, 5]) == [5, 5, 5, 5, 5]

True

In [185]:
non_unique([10, 9, 10, 10, 9, 8]) == [10, 9, 10, 10, 9]

True

## Getting Started with Pandas

In [186]:
import pandas as pd
pd.__version__

'1.4.2'

## Loading the Dataset

### Downloading CSV with `Requests`

In [187]:
import requests

url = 'https://bit.ly/396WyAZ'

response = requests.get(url)

In [188]:
response.status_code
response.headers['Content-Type']
response.text

"name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating\r\n100% Bran,N,C,70,4,1,130,10,5,6,280,25,3,1,0.33,68.402973\r\n100% Natural Bran,Q,C,120,3,5,15,2,8,8,135,0,3,1,1,33.983679\r\nAll-Bran,K,C,70,4,1,260,9,7,5,320,25,3,1,0.33,59.425505\r\nAll-Bran with Extra Fiber,K,C,50,4,0,140,14,8,0,330,25,3,1,0.5,93.704912\r\nAlmond Delight,R,C,110,2,2,200,1,14,8,-1,25,3,1,0.75,34.384843\r\nApple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1,0.75,29.509541\r\nApple Jacks,K,C,110,2,0,125,1,11,14,30,25,2,1,1,33.174094\r\nBasic 4,G,C,130,3,2,210,2,18,8,100,25,3,1.33,0.75,37.038562\r\nBran Chex,R,C,90,2,1,200,4,15,6,125,25,1,1,0.67,49.120253\r\nBran Flakes,P,C,90,3,0,210,5,13,5,190,25,3,1,0.67,53.313813\r\nCap'n'Crunch,Q,C,120,1,2,220,0,12,12,35,25,2,1,0.75,18.042851\r\nCheerios,G,C,110,6,2,290,2,17,1,105,25,1,1,1.25,50.764999\r\nCinnamon Toast Crunch,G,C,120,1,3,210,0,13,9,45,25,2,1,0.75,19.823573\r\nClusters,G,C,110,3,2,140,2,13,7,105,25,3,

In [189]:
import csv

lines = response.text.split('\r\n')
reader = csv.reader(lines)
# list(reader)
lines

['name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating',
 '100% Bran,N,C,70,4,1,130,10,5,6,280,25,3,1,0.33,68.402973',
 '100% Natural Bran,Q,C,120,3,5,15,2,8,8,135,0,3,1,1,33.983679',
 'All-Bran,K,C,70,4,1,260,9,7,5,320,25,3,1,0.33,59.425505',
 'All-Bran with Extra Fiber,K,C,50,4,0,140,14,8,0,330,25,3,1,0.5,93.704912',
 'Almond Delight,R,C,110,2,2,200,1,14,8,-1,25,3,1,0.75,34.384843',
 'Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1,0.75,29.509541',
 'Apple Jacks,K,C,110,2,0,125,1,11,14,30,25,2,1,1,33.174094',
 'Basic 4,G,C,130,3,2,210,2,18,8,100,25,3,1.33,0.75,37.038562',
 'Bran Chex,R,C,90,2,1,200,4,15,6,125,25,1,1,0.67,49.120253',
 'Bran Flakes,P,C,90,3,0,210,5,13,5,190,25,3,1,0.67,53.313813',
 "Cap'n'Crunch,Q,C,120,1,2,220,0,12,12,35,25,2,1,0.75,18.042851",
 'Cheerios,G,C,110,6,2,290,2,17,1,105,25,1,1,1.25,50.764999',
 'Cinnamon Toast Crunch,G,C,120,1,3,210,0,13,9,45,25,2,1,0.75,19.823573',
 'Clusters,G,C,110,3,2,140,2

### Downloading Dataset with Pandas

In [190]:
# download csv file from url
url = 'https://bit.ly/396WyAZ'
df = pd.read_csv(url)
type(df)

pandas.core.frame.DataFrame

### Saving the Dataframe to File

In [191]:
from pathlib import Path

path = Path('data/cereals.csv')
df.to_csv(path)

### Reading Dataset from File

In [192]:
# read dataset and set index to value of column 0
df = pd.read_csv(path, index_col=0 )

## Dataframe Inspection

In [193]:
# index df
df.index

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
            51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
            68, 69, 70, 71, 72, 73, 74, 75, 76],
           dtype='int64')

In [194]:
# number of lines
len(df)
df.shape[0]
df['name'].count()

77

In [195]:
# number of columns
df.shape[1]
len(df.columns)

16

In [196]:
# testing the efficiency of both approaches
import timeit

statement = '''
import pandas as pd

df = pd.read_csv('data/cereals.csv', index_col=0)
df.shape[0]
#len(df)
'''

timeit.timeit(statement, number=5000)

27.792767800972797

In [197]:
# sumarne info o df
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      77 non-null     object 
 1   mfr       77 non-null     object 
 2   type      77 non-null     object 
 3   calories  77 non-null     int64  
 4   protein   77 non-null     int64  
 5   fat       77 non-null     int64  
 6   sodium    77 non-null     int64  
 7   fiber     77 non-null     float64
 8   carbo     77 non-null     float64
 9   sugars    77 non-null     int64  
 10  potass    77 non-null     int64  
 11  vitamins  77 non-null     int64  
 12  shelf     77 non-null     int64  
 13  weight    77 non-null     float64
 14  cups      77 non-null     float64
 15  rating    77 non-null     float64
dtypes: float64(5), int64(8), object(3)
memory usage: 10.2+ KB


In [198]:
# vrati dtypes daneho df
df.dtypes

name         object
mfr          object
type         object
calories      int64
protein       int64
fat           int64
sodium        int64
fiber       float64
carbo       float64
sugars        int64
potass        int64
vitamins      int64
shelf         int64
weight      float64
cups        float64
rating      float64
dtype: object

In [199]:
# pocet vsetkych prvkov/buniek/cells
df.size
77 * 16

1232

In [200]:
# test, ci prvok sa nachadza v zozname, resp. ci existuje stlpec s danym menom
df.columns
'name' in df.columns

True

## Series Intro

In [201]:
# stlpec z tabulky (df) je typu series
col = df['name']
type(col)
col

0                     100% Bran
1             100% Natural Bran
2                      All-Bran
3     All-Bran with Extra Fiber
4                Almond Delight
                ...            
72                      Triples
73                         Trix
74                   Wheat Chex
75                     Wheaties
76          Wheaties Honey Gold
Name: name, Length: 77, dtype: object

In [202]:
# read dataset and make it series (only one column)
s = pd.read_csv(path, usecols=['name']).squeeze(True)
type(s)


pandas.core.series.Series

In [203]:
s.index

RangeIndex(start=0, stop=77, step=1)

## Questions

1. Ktoré cereálie majú najlepšie hodnotenie?
2. Ktoré majú najmenej cukru?
3. Ktoré cereálie majú najviac bielkovín?
4. Top 10 najlepších cereálií?
5. Koľko cereálií vyrába Kelloggs?
6. Priemerné hodnotenie
7. Všetky cereálie, ktoré sa nachádzajú na najvyššej poličke
8. Koľko je možné podávať za studena a koľko za tepla?
9. Najväčšia kalorická bomba?

### `Dataframe.loc[]`

In [204]:
cereals = df

# select rows 4, 5, 10 and show only columns 'name' and 'rating'
cereals.loc[[4, 5, 10], ['name', 'rating']]

Unnamed: 0,name,rating
4,Almond Delight,34.384843
5,Apple Cinnamon Cheerios,29.509541
10,Cap'n'Crunch,18.042851


In [205]:
cereals['rating'] # df.rating

0     68.402973
1     33.983679
2     59.425505
3     93.704912
4     34.384843
        ...    
72    39.106174
73    27.753301
74    49.787445
75    51.592193
76    36.187559
Name: rating, Length: 77, dtype: float64

### 5. Koľko cereálií vyrába Kelloggs?

In [206]:
# vytvorim filter pre selektovanie len tych cerealii, ktore vyraba Kelloggs
filter = cereals['mfr'] == 'K'

# aplikujem filter a necham si zobrazit len stlpec name a mfr
df = cereals.loc[ filter, ['name', 'mfr'] ]

# spocitam riadky
df['name'].count()
len(df)
df.shape[0]

23

### 1. Ktoré cereálie majú najlepšie hodnotenie?

In [228]:
max(cereals['rating'])
max_rating = cereals['rating'].max()
# cereals.max()

filter = (cereals['rating'] == max_rating) 
cereals.loc[filter, ['name', 'rating', 'mfr']]


Unnamed: 0,name,rating,mfr
3,All-Bran with Extra Fiber,93.704912,K


### 2. Ktoré majú najmenej cukru?

In [234]:
filter = (cereals['sugars'] == cereals['sugars'].min())
cereals.loc[filter, ['name','sugars', 'rating']]

Unnamed: 0,name,sugars,rating
57,Quaker Oatmeal,-1,50.828392


### 3. Ktoré cereálie majú najviac a ktoré najmenej bielkovín?

In [246]:
# solution 1
filter_min = cereals['protein'] == cereals['protein'].min()
filter_max = cereals['protein'] == cereals['protein'].max()

cereals.loc[ filter_min | filter_max, ['name', 'protein'] ]

# solution 2
# cereals['protein'] in [min, max]
filter = cereals['protein'].isin([cereals['protein'].min(), cereals['protein'].max()])
cereals.loc[ filter, ['name', 'protein'] ]

Unnamed: 0,name,protein
10,Cap'n'Crunch,1
11,Cheerios,6
12,Cinnamon Toast Crunch,1
14,Cocoa Puffs,1
17,Corn Pops,1
18,Count Chocula,1
25,Frosted Flakes,1
29,Fruity Pebbles,1
31,Golden Grahams,1
35,Honey Graham Ohs,1


### X. Vypíš najlepšie cereálie od Quakers Oats

In [250]:
df = cereals.loc[cereals['mfr'] == 'Q']
df.loc[df['rating'] == df['rating'].max(), ['name', 'mfr', 'rating']]

Unnamed: 0,name,mfr,rating
55,Puffed Wheat,Q,63.005645


### 4. Top 10 najlepších cereálií?

In [280]:
cereals.sort_values('rating', ascending=False).head(10).loc[:, ['name', 'rating']]  # [['name', 'rating']]

Unnamed: 0,name,rating
3,All-Bran with Extra Fiber,93.704912
64,Shredded Wheat 'n'Bran,74.472949
65,Shredded Wheat spoon size,72.801787
0,100% Bran,68.402973
63,Shredded Wheat,68.235885
20,Cream of Wheat (Quick),64.533816
55,Puffed Wheat,63.005645
54,Puffed Rice,60.756112
50,Nutri-grain Wheat,59.642837
2,All-Bran,59.425505


### X. Top 10 najlepších cereálií od *American Home Food Products* a *General Mills*

In [296]:
# filter = ((cereals['mfr'] == 'A') | (cereals['mfr'] == 'G'))
filter = cereals['mfr'].isin(['A', 'G'])
cereals.loc[filter, ['name', 'mfr', 'rating']].sort_values('rating', ascending=False).head(10)

Unnamed: 0,name,mfr,rating
43,Maypo,A,54.850917
75,Wheaties,G,51.592193
11,Cheerios,G,50.764999
71,Total Whole Grain,G,46.658844
13,Clusters,G,40.400208
47,Multi-Grain Cheerios,G,40.105965
59,Raisin Nut Bran,G,39.7034
40,Kix,G,39.241114
72,Triples,G,39.106174
69,Total Corn Flakes,G,38.839746


### 5. Koľko cereálií vyrába Kelloggs?

In [300]:
filter = cereals['mfr'] == 'K'
cereals.loc[filter, 'name'].count()

23

### 6. Priemerné hodnotenie

In [304]:
cereals['rating'].mean()

42.66570498701299

### 7. Všetky cereálie, ktoré sa nachádzajú na najvyššej poličke

In [307]:
max_shelf = cereals['shelf'].max()
filter = cereals['shelf'] == max_shelf
cereals.loc[filter, :]

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
7,Basic 4,G,C,130,3,2,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562
9,Bran Flakes,P,C,90,3,0,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813
13,Clusters,G,C,110,3,2,140,2.0,13.0,7,105,25,3,1.0,0.5,40.400208
19,Cracklin' Oat Bran,K,C,110,3,3,140,4.0,10.0,7,160,25,3,1.0,0.5,40.448772
21,Crispix,K,C,110,2,0,220,1.0,21.0,3,30,25,3,1.0,1.0,46.895644


### 8. Koľko je možné podávať za studena a koľko za tepla?

In [328]:
cereals.loc[ cereals['type'] == 'C' ]['name'].count(), cereals.loc[ cereals['type'] == 'H' ]['name'].count()

(74, 3)

In [332]:
cereals.groupby('type')['type'].count()

type
C    74
H     3
Name: type, dtype: int64

In [333]:
cereals['type'].value_counts()

C    74
H     3
Name: type, dtype: int64

### 9. Najväčšia kalorická bomba?

In [336]:
cereals.loc[ cereals['calories'] == cereals['calories'].max(), ['name', 'calories'] ]

Unnamed: 0,name,calories
46,Mueslix Crispy Blend,160
