# Fraud Detection Project

## Previous Notebooks

- [EDA](1-EDA.ipynb)
- [Network Analysis](2.1-Network.ipynb)

In [1]:
import numpy as np
import pandas as pd
import networkx as nx

In this notebook I will propagate the PageRank calculated for the entities involved in a claim to all the lawyers involved in a claim; then again I will use this scores to calculate an overall score for a claim.

To propagate the score I will create a network of all lawyers and entities, using claims as links, and calculate PageRank on the resulting graph assigning the previously calculated weights to the entities and 0 to the lawyers as a starting score.

In [2]:
network = pd.read_csv('../data/raw/network_lawyers.csv', sep=';')

In [3]:
nw = nx.MultiGraph()
nw.add_edges_from(network[['id_lt', 'id_rt']].values);

In [4]:
print('Number of nodes: {}\nNumber of edges: {}\nNumber of connected components: {}'
      .format(nw.number_of_nodes(), nw.number_of_edges(), nx.number_connected_components(nw)))

Number of nodes: 62523
Number of edges: 98945
Number of connected components: 8652


In [5]:
node_scores = pd.read_pickle('../data/interim/node_scores.pkl')

In [6]:
weights_df = pd.DataFrame(list(nw.nodes())).merge(node_scores, how='left', left_on=0, right_index=True)\
                .drop('weights', axis=1).rename(columns={0:'id'})

In [7]:
weights_df['pagerank'].sum()

0.2930087487668862

I'm normalizing the weights to make them sum to 1, as it is requested by PageRank:

In [8]:
weights = (weights_df['pagerank'] / weights_df['pagerank'].sum()).fillna(0).values
weights.sum()

1.0000000000005436

## PageRank

In [9]:
from scipy import sparse
from sklearn.preprocessing import normalize

In [10]:
adj = nx.adj_matrix(nw, weight=None)
M = normalize(adj, norm='l1', axis=1)

In [11]:
v = weights.reshape(-1, 1)

In [12]:
# initializing the rank with 1/number of nodes
r = np.ones(len(v)) / len(v)
r = r.reshape(-1, 1)

In [13]:
# precision level
eps = 1e-9
d = 0.9 # damping factor
r_prev = r
r = d * M.T.dot(r_prev) + (1-d) * v
while np.linalg.norm(r - r_prev) > eps:
    r_prev = r
    r = d * M.T.dot(r_prev) + (1-d) * v

In [14]:
node_scores = pd.DataFrame(np.hstack([v, r]), index=list(weights_df['id']))
node_scores.rename(columns={0: 'weights', 1:'pagerank'}, inplace=True)

In [15]:
node_scores.loc[network['id_rt'].unique()].sort_values(by='pagerank', ascending=False).head(25)

Unnamed: 0,weights,pagerank
195922,0.0,0.008453
197116,0.0,0.002025
194163,0.0,0.001858
193391,0.0,0.001026
196018,0.0,0.000982
194995,0.0,0.000827
207774,0.0,0.000786
200277,0.0,0.000775
195022,0.0,0.000742
199514,0.0,0.00072


## Scoring the Claims

As previously I'm going to use the mean of the scores of all the lawyers involved.

In [16]:
network = network.merge(node_scores, how='left', left_on='id_rt', right_index=True)\
                .drop('weights', axis=1).rename(columns={'pagerank':'lawyer_pagerank'})

In [17]:
network['lawyer_score'] = network['lawyer_pagerank']

In [18]:
nw_scores = network.groupby('claim_code')['lawyer_score'].mean().reset_index()
nw_scores.to_pickle('../data/interim/lawyers_network_scores.pkl')

## Following Notebooks

- [Witnesses' Network Analysis](2.3-Network-Witnesses.ipynb)
- [Dataset Creation](3-Input_dataset_creation.ipynb)
- [Random Forest Prediction](4-Model.ipynb)