# S04 - Retour sur pandas

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/data/salesCerealsOriginal.csv'
df = pd.read_csv(url , parse_dates=['WEEK_END_DATE'])
df.head()

Unnamed: 0,WEEK_END_DATE,STORE_NUM,UPC,UNITS,VISITS,HHS,SPEND,PRICE,BASE_PRICE,FEATURE,DISPLAY,TPR_ONLY,DESCRIPTION,CATEGORY,SUB_CATEGORY
0,2009-01-14,367,1111085319,14,13,13,26.32,1.88,1.88,0,0,0,PL HONEY NUT TOASTD OATS,COLD CEREAL,ALL FAMILY CEREAL
1,2009-01-14,367,1111085350,35,27,25,69.3,1.98,1.98,0,0,0,PL BT SZ FRSTD SHRD WHT,COLD CEREAL,ALL FAMILY CEREAL
2,2009-01-14,367,1600027527,12,10,10,38.28,3.19,3.19,0,0,0,GM HONEY NUT CHEERIOS,COLD CEREAL,ALL FAMILY CEREAL
3,2009-01-14,367,1600027528,31,26,19,142.29,4.59,4.59,0,0,0,GM CHEERIOS,COLD CEREAL,ALL FAMILY CEREAL
4,2009-01-14,367,1600027564,56,48,42,152.32,2.72,3.07,1,0,0,GM CHEERIOS,COLD CEREAL,ALL FAMILY CEREAL


## Groupby (split-apply-combine)

Plusieurs similitudes entre `groupby` dans pandas et les tableaux croisés dynamiques dans Excel. Nous allons maintenant regarder les différentes opérations sous-jacentes lorsqu'on utilise `groupby`.

[Lien pour plus d'informations](https://realpython.com/pandas-groupby/)

In [2]:
# dans le livre:
df.groupby('UPC').PRICE.mean()

UPC
1111085319    1.788910
1111085350    2.183333
1600027527    2.893077
1600027528    4.498590
1600027564    2.893355
3000006340    2.878195
3800031829    3.256774
Name: PRICE, dtype: float64

In [3]:
# split
by_upc = df.groupby('UPC')
print(by_upc)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002709F2E1D10>


In [4]:
for upc, frame in by_upc:
    print(upc)
    print(frame.head())

1111085319
   WEEK_END_DATE  STORE_NUM         UPC  UNITS  VISITS  HHS  SPEND  PRICE  \
0     2009-01-14        367  1111085319     14      13   13  26.32   1.88   
7     2009-01-21        367  1111085319     12      12   12  22.68   1.89   
14    2009-01-28        367  1111085319     18      17   16  33.66   1.87   
21    2009-02-04        367  1111085319     13      13   13  24.44   1.88   
28    2009-02-11        367  1111085319     16      16   16  29.92   1.87   

    BASE_PRICE  FEATURE  DISPLAY  TPR_ONLY               DESCRIPTION  \
0         1.88        0        0         0  PL HONEY NUT TOASTD OATS   
7         1.89        0        0         0  PL HONEY NUT TOASTD OATS   
14        1.87        0        0         0  PL HONEY NUT TOASTD OATS   
21        1.88        0        0         0  PL HONEY NUT TOASTD OATS   
28        1.87        0        0         0  PL HONEY NUT TOASTD OATS   

       CATEGORY       SUB_CATEGORY  
0   COLD CEREAL  ALL FAMILY CEREAL  
7   COLD CEREAL  AL

In [5]:
df.loc[df.UPC == 1111085319].head()

Unnamed: 0,WEEK_END_DATE,STORE_NUM,UPC,UNITS,VISITS,HHS,SPEND,PRICE,BASE_PRICE,FEATURE,DISPLAY,TPR_ONLY,DESCRIPTION,CATEGORY,SUB_CATEGORY
0,2009-01-14,367,1111085319,14,13,13,26.32,1.88,1.88,0,0,0,PL HONEY NUT TOASTD OATS,COLD CEREAL,ALL FAMILY CEREAL
7,2009-01-21,367,1111085319,12,12,12,22.68,1.89,1.89,0,0,0,PL HONEY NUT TOASTD OATS,COLD CEREAL,ALL FAMILY CEREAL
14,2009-01-28,367,1111085319,18,17,16,33.66,1.87,1.87,0,0,0,PL HONEY NUT TOASTD OATS,COLD CEREAL,ALL FAMILY CEREAL
21,2009-02-04,367,1111085319,13,13,13,24.44,1.88,1.88,0,0,0,PL HONEY NUT TOASTD OATS,COLD CEREAL,ALL FAMILY CEREAL
28,2009-02-11,367,1111085319,16,16,16,29.92,1.87,1.87,0,0,0,PL HONEY NUT TOASTD OATS,COLD CEREAL,ALL FAMILY CEREAL


In [6]:
# split + apply
# combine consists in returning everything in one `DataFrame`
for upc, frame in by_upc:
    print(upc, frame.PRICE.mean())

1111085319 1.7889102564102564
1111085350 2.1833333333333336
1600027527 2.8930769230769235
1600027528 4.4985897435897435
1600027564 2.8933548387096777
3000006340 2.8781954887218046
3800031829 3.256774193548387


Pour plusieurs groupes, plusieurs colonnes, plusieurs opérations, on peut faire:

In [7]:
# multi-index pour l'index et les colonnes
temp = df.groupby(['STORE_NUM', 'UPC'])[['PRICE', 'SPEND']].agg(
    {'PRICE': ['mean', 'min'],
     'SPEND': ['max', 'median']})
print(temp)

                         PRICE         SPEND         
                          mean   min     max   median
STORE_NUM UPC                                        
367       1111085319  1.788910  1.60   68.06   28.050
          1111085350  2.183333  1.82  104.41   39.440
          1600027527  2.893077  1.66  403.38   49.235
          1600027528  4.498590  2.60  414.40  124.235
          1600027564  2.893355  1.51  198.40   65.780
          3000006340  2.878195  1.88   87.75   14.300
          3800031829  3.256774  2.18  176.22   76.020


In [8]:
print(temp.columns)
# on retire le multi-index pour les colonnes
temp.columns = ['_'.join(col) for col in temp.columns]
print(temp.columns)

MultiIndex([('PRICE',   'mean'),
            ('PRICE',    'min'),
            ('SPEND',    'max'),
            ('SPEND', 'median')],
           )
Index(['PRICE_mean', 'PRICE_min', 'SPEND_max', 'SPEND_median'], dtype='object')


In [9]:
# on retire le multi-index pour l'index
temp = temp.reset_index()
print(temp)

   STORE_NUM         UPC  PRICE_mean  PRICE_min  SPEND_max  SPEND_median
0        367  1111085319    1.788910       1.60      68.06        28.050
1        367  1111085350    2.183333       1.82     104.41        39.440
2        367  1600027527    2.893077       1.66     403.38        49.235
3        367  1600027528    4.498590       2.60     414.40       124.235
4        367  1600027564    2.893355       1.51     198.40        65.780
5        367  3000006340    2.878195       1.88      87.75        14.300
6        367  3800031829    3.256774       2.18     176.22        76.020


## Join et concat

Il y a plusieurs façon de joindre 2 jeux de données ensemble. Nous allons simplement focusser sur la méthode `join` pour cette première partie. La fonction `merge` revient à peu près à la même chose, mais permet un peu plus de flexibilité.

Certaines des opérations que l'on va faire reviennent à faire une `RECHERCHEV` dans Excel.

| Merge/join method | SQL Join Name  | Description  | 
| ------- | --------| ---------|
| ``left`` | ``LEFT OUTER JOIN`` |  Use keys from left frame only | 
| ``right`` | ``RIGHT OUTER JOIN`` |  Use keys from right frame only | 
| ``outer`` | ``FULL OUTER JOIN`` |  Use union of keys from both frames | 
| ``inner`` | ``INNER JOIN`` | Use intersection of keys from both frames | 
| ``cross`` | ``CROSS JOIN`` |  Create the cartesian product of rows of both frames | 


![merge method](https://raw.githubusercontent.com/acedesci/scanalytics/master/FR/S04_Data_Structures_2/_static/merge_method.jpg)

In [10]:
left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], 
     "B": ["B0", "B1", "B2"]}, 
     index=["K0", "K1", "K2"]
)
left

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [11]:
right = pd.DataFrame(
    {"C": ["C0", "C2", "C3"], 
     "D": ["D0", "D2", "D3"]}, 
     index=["K0", "K2", "K3"]
)
right

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [12]:
left.join(right)  # default is `how='left'`

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [13]:
left.join(right, how='right')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2
K3,,,C3,D3


In [14]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


In [15]:
left.join(right, how='inner')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2


In [16]:
left.join(right, how='cross')

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A0,B0,C2,D2
2,A0,B0,C3,D3
3,A1,B1,C0,D0
4,A1,B1,C2,D2
5,A1,B1,C3,D3
6,A2,B2,C0,D0
7,A2,B2,C2,D2
8,A2,B2,C3,D3


De son côté, la fonction `concat` sert généralement à joindre deux jeux de données en ajoutant les lignes de l'un à l'autre.

In [21]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    })
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [22]:
df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    })
df2

Unnamed: 0,A,B,C,D
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7


In [27]:
pd.concat([df1, df2])  # default: `axis='rows'`

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7


On peut aussi simplement ajouter les colonnes à droite. *Dans ce cas, il faut s'assurer que les 2 `DataFrame` ont les observations dans le même ordre et sans aucune ligne manquante.* Il est généralement préférable d'utiliser la méthode `join`, vue précédemment, à la place.

In [28]:
pd.concat([df1, df2], axis='columns')

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,A4,B4,C4,D4
1,A1,B1,C1,D1,A5,B5,C5,D5
2,A2,B2,C2,D2,A6,B6,C6,D6
3,A3,B3,C3,D3,A7,B7,C7,D7
