<b>Detect transactions outliers</b>

    You have got a dataset
    You have got a demonstration of preliminary data clenaning (lab2.1.ipynb, lab2.2.ipynb)
    You have got a pickled card_lbe and doc_lbe (pipeline in lab2.lbe.ipynb) to make the same mapping of id_card and id_doc

<b>Task</b>

Use Characteristics of quantitative data and techniques for detection of inadequate transactions and collect id_card (mapped by using card_lbe) you think not real customers.

Create a pull request by using a format hw1.1;lastname;firstname

In [1]:
import pandas as pd
import os, gc, sys, pickle, bz2
from datetime import datetime
from pathlib import Path

import matplotlib.pyplot as plt
import IPython

Outliers could be defined as following: 

* too many transactions per card - say, more than 90 per month
    * While real person is expected to shop 1-2 times per day, other else are either maniacs, or entities. 

* too large amount is passed - say, 10000 rubles per bill is too much
    * Who has a lot of money to buy shit?

Let's define outliers via the idea of confidence intervals: everyone who is out of interval $[\mu - 3\sigma=>0, \mu + 3 \sigma]$ are outliers.

In [3]:
def return_expected_outliers(file_path: str) -> pd.DataFrame:
    with bz2.open(file_path, 'rb') as f:
        month_df = pickle.load(f).drop(columns=['id_card', 'is_green', 'id_doc']) 
        month_df['date'] = month_df['date'].dt.date
    
    # 1. By sum of transactions per card_id
    gr_month_df = month_df.groupby(by=['id_card_int']).sum()
    gr_month_df = gr_month_df.sort_values(by=['sum'], ascending=False)

    mean, std = gr_month_df['sum'].mean(), gr_month_df['sum'].std()
    print(f'''Threshold sum per card per day: {mean + 3*std:.2f} rubles. Mean sum: {mean:.2f}.
    Amount of outliers so far: {len(gr_month_df[gr_month_df["sum"] > mean + 3*std])}.''')
    outliers_ids = gr_month_df[gr_month_df['sum'] > mean + 3*std].index.tolist()


    # 2. By count of transactions per card_id: how many operations per month are occured?
    cnt_month_df = month_df.groupby(by=['id_card_int']).count()
    cnt_thresh = cnt_month_df['sum'].mean() + 3 * cnt_month_df['sum'].std()
    print(f'''Thresold for count: {cnt_thresh}; average amount of operations: {cnt_month_df["sum"].mean()}.
    Amount of outliers so far: {len(cnt_month_df[cnt_month_df["sum"]>cnt_thresh])}.
    ''')
    outliers_ids += cnt_month_df[cnt_month_df['sum'] > cnt_thresh].index.tolist()

    # 3. Free the memory
    month_df = month_df.iloc[0:0]
    gr_month_df = gr_month_df.iloc[0:0]
    cnt_month_df = cnt_month_df.iloc[0:0]

    # 4. Return data
    outliers_ids = sorted(list(set(outliers_ids)))
    outliers_df = pd.DataFrame(outliers_ids, columns=['id_card_int'])
    print(f"Returned DataFrame with length {len(outliers_df)}")
    
    return outliers_df

## Cleaned BZ-archives (thanks to Vladislav these DataFrames contain card_lbe)

In [4]:
! ls *prepared*.bz2

09_prepared.pkl.bz2  10_prepared.pkl.bz2  11_prepared.pkl.bz2


In [5]:
sep_df = return_expected_outliers('09_prepared.pkl.bz2')

Threshold sum per card per day: 53064.52 rubles. Mean sum: 3562.26.
    Amount of outliers so far: 390.
Thresold for count: 536.4578857920889; average amount of operations: 31.000072896548016.
    Amount of outliers so far: 35.
    
Returned DataFrame with length 401


In [6]:
oct_df = return_expected_outliers('10_prepared.pkl.bz2')

Threshold sum per card per day: 57318.34 rubles. Mean sum: 3888.71.
    Amount of outliers so far: 376.
Thresold for count: 577.6142678863583; average amount of operations: 33.81559956626326.
    Amount of outliers so far: 52.
    
Returned DataFrame with length 389


In [7]:
nov_df = return_expected_outliers('11_prepared.pkl.bz2')

Threshold sum per card per day: 64112.61 rubles. Mean sum: 4161.58.
    Amount of outliers so far: 214.
Thresold for count: 643.5709774036437; average amount of operations: 34.81959219924545.
    Amount of outliers so far: 12.
    
Returned DataFrame with length 214


In [14]:
res_df = pd.concat([sep_df, oct_df, nov_df]).drop_duplicates()

with bz2.open('outliers_df.pkl.bz2', 'wb') as f:
    pickle.dump(res_df, f, protocol=4)