# Introduction

This notebook illustrates how to build a recommendation system for github's projects. Web resource parsing and collaborative crawler technology would be used to collect data. Two advanced recommendation algorithms(`User-based collobrative filtering` and `GCMC`) were exploited to search the prefrence of the specific user. 

We'll overview the whole process of implementation and evaluate the performance. To see more concrete details, please refer to <a href="https://github.com/YuDongPan/Github_Recommendation">https://github.com/YuDongPan/Github_Recommendation</a>

# Preparation

## Configuration 

- Setup a virtual environment with python 3.8 or newer
- Install required dependencies

In [1]:
!pip install -r Resource/requirements1.txt
!pip install -r Resource/requirements2.txt

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


##  Prepare a token list csv file

We need a token list of github platform, which will help us to break through the speed limit of github public API interface.

In [2]:
import pandas as pd

token_csv = pd.read_csv('./Resource/tokens.csv')
token_csv

Unnamed: 0,name,token
0,YuDong Pan,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
1,YuDong Pan,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
2,YuDong Pan,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
3,YuDong Pan,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
4,YuDong Pan,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
5,YuDong Pan,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
6,YuDong Pan,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
7,YuDong Pan,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
8,YuDong Pan,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


# Data Collection

Web resource parsing and collaborative crawler technology will be used to collect data. The whole process is divided into three parts:

* **UserInfo Collection**: We collect user information(`username` and `homepage`) from the following list of the top 30 most popular projects on the GitHub platform.
* **UserStar Collection**: Analyze the number of repos owned by the user and the number of projects starred by the user through the user's homepage.
* **UserProject Collection**: 
   - Call GitHub's public API interface according to the username to obtain the project records:
    ```https://api.github.com/users/username/starred ```
   - Use `aiohttp` to speed up the process.
   - Exploit `token list` and `random function` to break the number of interface accesses.

Correspondingly, four python codes should be excuted in order:
1. **get_user_info.py**
2. **get_user_star.py**
3. **filter_user.py**
3. **get_user_repo.py**

In [1]:
%run Reptile/get_user_info.py
%run Reptile/get_user_star.py
%run Reptile/github_users/filter_user.py
%run Reptile/get_user_repo.py

D:\Jupyter\Github_recommendation_v2\Reptile


## Data Description
* The data field contains five fields, namely, the user name, the project name (full name), the number of stars and forks of the project, and whether the user has starred the project.The data is organized into CSV files as follows.

| user | project | star | fork | has_star |
| ---- | ---- | ---- |---- |---- |

* Based on different requirements, we provide two sizes of data folder for users to process:`tiny`,`small`,`large`.Each data folder includes three types of csv files:
    - `users`: User information table, include the mapping relationship between index and username
    - `projects`:Project information table, use three fileds('name', 'star', 'fork') to depict projects
    - `data`:Correlation information between user and project. In this project, we use the field 'has_star' to manifest the relationship. In the tiny dataset, it includes 2105 users, 4761 projects, 311305 records totally. In the small dataset, it includes 3000 users, 182404 projects, 929489 records totally.In the large dataset, it includes 70129 users, 271530 projects, 21775242 records totally.

In [2]:
import pandas as pd

data_large = pd.read_csv('./data/large/raw/data.csv')
data_large

Unnamed: 0,user,project,has_star
0,0,143595,1
1,0,98287,1
2,0,154934,1
3,0,40917,1
4,0,75375,1
...,...,...,...
21775237,70128,2100,1
21775238,70128,224532,1
21775239,70128,265154,1
21775240,70128,86429,1


# Recomendation algorithm design
## User-based Collaborative Filtering
 1. We build a `similarity matrix` of users according to the projects starred by users.
 2. For each target user, we find top N `similiar users` to him/her. 
 3. Recommend top K projects starred by these similiar users.
 4. For each recommened project, the target user has never seen ever before.

![image](Image/UbCF.png)

## GC-MC(Graph Convolution Matrix Completion, Berg et al. KDD 2018)
1. We consider the recommendation task as a `link prediction` problem.
2. Since the original dataset has only connected positive edges, we use the `negative sampling` technique to sample the negative edges with the same number of connected positive edges.
3. Thus, this problem degenerates into a `binary classification` problem.
4. After training, the trained model was used to calculate the probablity of each project starred by the target user.
5. Select top K projects with high probability.
6. For each recommened project, the target user has never seen ever before.

![image](Image/GCMC.jpg)

# Evaluation

## Metrics

- `precision`: The ratio of the user's favorite products to all recommended products in the system's recommended list.
        
- `recall`: The ratio of products that users like in the recommendation list to all products that users like in the system.

- `coverage`: Describe the ability of a recommendation system to mine the long tail of an item.

- `popularity`: Popularity bias.

In [1]:
%cd Test
%run test_Github_UbCF.py

D:\Jupyter\Github_recommendation_v2\Test
similar user number 20
recommended item number 10
loading ../data/small/raw/data.csv, 0
loading ../data/small/raw/data.csv, 100000
loading ../data/small/raw/data.csv, 200000
loading ../data/small/raw/data.csv, 300000
loading ../data/small/raw/data.csv, 400000
loading ../data/small/raw/data.csv, 500000
loading ../data/small/raw/data.csv, 600000
loading ../data/small/raw/data.csv, 700000
loading ../data/small/raw/data.csv, 800000
loading ../data/small/raw/data.csv, 900000
load../data/small/raw/data.csv success
split training dataset and test dataset success
length of training dataset: 464429
length of test dataset: 465061
building item-users inverse table...
build item-users inverse table success
total item number 132376
building user co-related item matrix
build user co-related item matrix success
calculating user similarity matrix...
calculating user similarity factor 200000
calculating user similarity factor 400000
calculating user similarity f

recommended projects for user GeorgeShao: ['vuejs/vue', 'trekhleb/javascript-algorithms', 'sindresorhus/awesome', 'atlassian/react-beautiful-dnd', 'gatsbyjs/gatsby', 'electron/electron', 'godotengine/godot', 'jwasham/coding-interview-university', 'gothinkster/realworld', 'python-poetry/poetry']
recommended projects for user NELSONZHAO: ['donnemartin/data-science-ipython-notebooks', 'pallets/flask', 'google-research/bert', 'cybertronai/gradient-checkpointing', 'eriklindernoren/ML-From-Scratch', 'ageitgey/face_recognition', 'tensorflow/tensor2tensor', 'RaRe-Technologies/gensim', 'chiphuyen/stanford-tensorflow-tutorials', 'ray-project/ray']
recommended projects for user aishoot: ['facebookresearch/fairseq', 'donnemartin/system-design-primer', 'oarriaga/face_classification', 'JaidedAI/EasyOCR', 'Megvii-BaseDetection/YOLOX', 'jwasham/coding-interview-university', 'faif/python-patterns', 'igormq/ctc_tensorflow_example', 'PyCQA/isort', 'ZuzooVn/machine-learning-for-software-engineers']
recomm

In [2]:
%run test_Github_GCMC.py

data: Data(x=[6868], edge_index=[2, 435692], edge_type=[435692], train_idx=[217846], test_idx=[93458], train_gt=[217846], test_gt=[93458], num_users=[1], num_items=[1], num_user_items=[2105])
config: {'epochs': 20, 'lr': 0.001, 'weight_decay': 0, 'drop_prob': 0.7, 'topK': 50, 'accum': 'split_stack', 'hidden_size': [500, 75], 'num_basis': 2, 'rgc_bn': True, 'rgc_relu': True, 'dense_bn': True, 'dense_relu': True, 'bidec_drop': False, 'root': '../data/tiny', 'dataset_name': 'tiny', 'gpu_id': -1}


  return _VF.chain_matmul(matrices)  # type: ignore[attr-defined]


[ Epoch:    0/20 | Loss: 0.787763 | Train ACC: 0.499445 | Test ACC: 0.500000
[ Epoch:    1/20 | Loss: 1.832993 | Train ACC: 0.500002 | Test ACC: 0.502862
[ Epoch:    2/20 | Loss: 0.815580 | Train ACC: 0.505467 | Test ACC: 0.499989
[ Epoch:    3/20 | Loss: 0.950476 | Train ACC: 0.500122 | Test ACC: 0.499781
[ Epoch:    4/20 | Loss: 0.866569 | Train ACC: 0.500046 | Test ACC: 0.500241
[ Epoch:    5/20 | Loss: 0.721814 | Train ACC: 0.503124 | Test ACC: 0.500032
[ Epoch:    6/20 | Loss: 0.813131 | Train ACC: 0.501774 | Test ACC: 0.500011
[ Epoch:    7/20 | Loss: 0.747044 | Train ACC: 0.510103 | Test ACC: 0.534738
[ Epoch:    8/20 | Loss: 0.710828 | Train ACC: 0.524171 | Test ACC: 0.554736
[ Epoch:    9/20 | Loss: 0.762430 | Train ACC: 0.510012 | Test ACC: 0.580924
[ Epoch:   10/20 | Loss: 0.731887 | Train ACC: 0.518616 | Test ACC: 0.532731
[ Epoch:   11/20 | Loss: 0.693452 | Train ACC: 0.544010 | Test ACC: 0.500310
[ Epoch:   12/20 | Loss: 0.711475 | Train ACC: 0.529906 | Test ACC: 0.500027

recommended projects for user lfchi: ['raspberrypi/linux', 'apache/tvm', 'apache/incubator-mxnet', 'getredash/redash', 'MingchaoZhu/DeepLearning', 'h5bp/Effeckt.css', 'felixrieseberg/macintosh.js', 'Shopify/toxiproxy', 'apple/swift', 'mybatis/mybatis-3', 'Tencent/xLua', 'go-kratos/kratos', 'pkmital/tensorflow_tutorials', 'JanusGraph/janusgraph', 'harness/drone', 'jinfagang/tensorflow_poems', 'callstack/linaria', 'junyanz/interactive-deep-colorization', 'protocolbuffers/protobuf', 'replicate/keepsake', 'reactstrap/reactstrap', 'taye/interact.js', 'bitwarden/server', 'switchablenorms/DeepFashion2', 'tonybeltramelli/pix2code', 'golden-layout/golden-layout', 'ColorlibHQ/gentelella', 'cocos2d/cocos2d-x', 'perwendel/spark', 'spacedriveapp/spacedrive', 'halo-dev/halo', 'ArduPilot/ardupilot', 'openai/mujoco-py', 'sshuttle/sshuttle', 'istio/istio', 'Shawn-Shan/fawkes', 'rxin/db-readings', 'KaimingHe/deep-residual-networks', 'google/physical-web', 'akveo/eva-icons', 'ceph/ceph', 'dataarts/dat.gu

recommended projects for user ryanzhumich: ['facebookresearch/swav', 'AllThingsSmitty/css-protips', 'alibaba/euler', 'jisungk/deepjazz', 'OpenAtomFoundation/pika', 'lutzroeder/netron', 'espnet/espnet', 'relativty/Relativty', 'highcharts/highcharts', 'Franck-Dernoncourt/NeuroNER', 'alexanderepstein/Bash-Snippets', 'fastlane/fastlane', 'cortexlabs/cortex', 'remix-run/history', 'jessevig/bertviz', 'TykTechnologies/tyk', 'jaredpalmer/razzle', 'the-paperless-project/paperless', 'yjs/yjs', 'checkly/headless-recorder', 'keon/algorithms', 'quay/clair', 'mozilla/BrowserQuest', 'yunjey/stargan', 'uglide/RedisDesktopManager', 'Textualize/rich', 'SophonPlus/ChineseNlpCorpus', 'ryanoasis/nerd-fonts', 'b3log/baidu-netdisk-downloaderx', 'primer/octicons', 'zllrunning/face-parsing.PyTorch', 'microsoft/muzic', 'jgraph/drawio', 'BVLC/caffe', 'cs01/gdbgui', 'apache/hadoop', 'unit8co/darts', 'rlworkgroup/garage', 'qbittorrent/qBittorrent', 'mmp/pbrt-v3', 'hujie-frank/SENet', 'casbin/casbin', 'http-party/h

recommended projects for user nickgu: ['julianshapiro/velocity', 'android/architecture-samples', 'ehang-io/nps', 'shadowsocks/shadowsocks-windows', 'byoungd/English-level-up-tips', 'reiinakano/scikit-plot', 'bentrevett/pytorch-sentiment-analysis', 'bryandlee/animegan2-pytorch', 'mickael-kerjean/filestash', 'Vay-keen/Machine-learning-learning-notes', 'share/sharedb', 'microsoftarchive/redis', 'PuerkitoBio/goquery', 'Ryujinx/Ryujinx', 'ymcui/Chinese-BERT-wwm', 'The-Run-Philosophy-Organization/run', 'ventoy/Ventoy', 'brianc/node-postgres', 'ml-jku/hopfield-layers', 'aFarkas/lazysizes', 's3prl/s3prl', 'karpathy/arxiv-sanity-preserver', 'bevyengine/bevy', 'oxford-cs-deepnlp-2017/lectures', 'emilwallner/Screenshot-to-code', 'NELSONZHAO/zhihu', 'flutter/plugins', 'facebookresearch/dlrm', 'ycm-core/YouCompleteMe', 'tensorflow/minigo', 'nonstriater/Learn-Algorithms', 'realpython/python-guide', 'i3/i3', 'wagerfield/parallax', 'dexteryy/spellbook-of-modern-webdev', 'tachiyomiorg/tachiyomi', 'joer