#### Loading our IMDB data

In [None]:
import pandas as pd
import numpy as np
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
movies = pd.read_csv('movies.csv', index_col=0)
directors = pd.read_csv('directors.csv',index_col=0)
data = movies.merge(directors, how='left', left_on='director_id',right_on='id')  
data.drop(['director_id','id_y'],axis=1,inplace=True)

Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 81.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 62.7MB/s]


#### How do we assess the budget of any movie w.r.t director?

For each director, We can subtract the average `budget` of a director from `budget` col

Let's first find the average budget of a director



In [None]:
data.groupby(['director_name'])['budget'].mean()

director_name
Adam McKay                     5.691667e+07
Adam Shankman                  4.837500e+07
Alejandro González Iñárritu    3.333333e+07
Alex Proyas                    7.040000e+07
Alexander Payne                1.560000e+07
                                   ...     
Wes Craven                     2.338000e+07
Wolfgang Petersen              9.014286e+07
Woody Allen                    1.177778e+07
Zack Snyder                    1.228571e+08
Zhang Yimou                    2.083335e+07
Name: budget, Length: 199, dtype: float64

Now, we can subtract it from the data of that `director_name`

- We can use a function for this transformation

This is known as **Group based Transformation**

In [None]:
def sub_avg(x):
  x["budget"] -= x["budget"].mean()
  
data.groupby(['director_name']).transform(sub_avg)

KeyError: ignored

But the `budget` column was present in our data

#### Does transform expect us to provide a column?


In [None]:
def inspect(x): #Inspect the code for the type
  print(x)
  print(type(x))
  raise

data.groupby(['director_name']).transform(inspect)

176    43882
323    44151
366    44236
505    44503
839    45301
916    45443
Name: id_x, dtype: int64
<class 'pandas.core.series.Series'>


RuntimeError: ignored

Notice,

the data type of x: pandas Series

Hence transform() can never work with 2 or more cols

#### How can we transform our column `budget` then for each director?

In [None]:
def sub_avg(x):
  x -= x.mean()
  return x

data.groupby(['director_name'])["budget"].transform(sub_avg)

0       130.300000
1       141.857143
2       150.142857
3       124.375000
4       174.004545
           ...    
1460    -47.478947
1461    -11.976667
1462    -21.700000
1463    -10.890909
1464    -31.168750
Name: budget, Length: 1465, dtype: float64

Notice,

Some numbers for the movies which are of **higher budget**, the result will **positive** and for others, it will be negative

