# <center> Matching Treatment and Control Groups </center>

<center>By Jiaoping Chen </center>

<img src="https://www.thoughtco.com/thmb/FzU2MP1eoOe6ZbD7T2bsVTWatmc=/768x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/control-and-experimental-group-differences-606113-FINAL1-5b7ad7d0c9e77c00574b71b5.png" width="100%">

Image from https://www.thoughtco.com/control-and-experimental-group-differences-606113

---
# Overview

Although the number of submitted papers in the computer science conference is explosively increasing, especially in the artificial intelligence and machine learning areas, the lack of replicability might hinder the spread of the influence of CS communities, at the early stage. Some researchers might prefer to share their codes with other readers but most are not. Thus, I am very curious if this mentioned-code-link behavior affects the paper's impact, such as the number of citations, if so, what is the magnitude of true effect? 

My goal is to estimate whether the amount of citations differ due to the mentioned-code-link behavior. However, endogeneity issues might occur if we do not account for unobservable properties that may both make the author share codes and also boost the paper's citation counts, such that he/she is a big name in a specific area. This endogeneity issue leads to a biased estimate of our interesting variable. Therefore, I plan to apply a matching approach before modeling in order to identify the causal effect. In particular, given a treated observation, a matching approach is to choose a most-similar observation from the untreated/control group, and then construct a pair of treated-control data. Finally, we can run regressions using a set of matched treated-control pairs, rather than the unmatched raw data, to measure the causal effect of the mentioned-code-link behaviors on paper citations.

In this project, I will firstly download and pre-process two large datasets (>10G) using python, conduct feature engineering and finally apply the matching algorithm. 


---
# Program Description

I already downloaded two datasets but haven't had a chance to preprocess and clean up the data. So this project will include two main components. 

- First, I will create a *"Pre-process.py"* python file to generate the cleaned version of two datasets. The first dataset is to obtain the response variable (the number of a paper's citation), the other is for the independent variable (whether the paper mentioned its 'Github link' to share open code resource). After that, I will match those two datasets to get my raw samples with my interested variables. Here, I define the treated group as articles that mention their corresponding Github links, while the "candidates" for the control group are those not mention code resources. (the candidate control group refers to the control group without synthetic control.)


- Second, I will create a *"match.py"* to apply a matching algorithm to select controlled/untreated samples for each treated observation from the treated group. Then, I will do some visualization to show the differences or characteristics for both before-matching and after-matching data.

---
# Project Goals and Timeline

**Short Term** Read and clean large datasets using efficient python codes; Understand the Synthetic control algorithm and apply it to solve the real-world problem.

**Mid Term** Write the log-likelihood function using python code; Solve it through the optimization toolkit. 

**Long Term** Collect more related variables for the response, the number of papers' citations. For example, extracting each paper's abstract and then do topic modeling, since the number of citation can be affected by the type of article. 

As a reminder, here are a set of dates (Approx. every other Friday) that include deliverables related to your projects. These sub-projects are designed to introduce you to useful software development tools


- 9/11/2020 - Create git repository 
- 9/25/2020 - Proposal Due
- 10/02/2020 - Upload the data and then star the pre-process datasets
- 10/09/2020 - Stub functions and Example code integration (With documentation)
- 10/16/2020 - finish datacleaning, and then get raw samples;
- 10/23/2020 - Unit Test Integration
- 10/30/2020 - finish Reading main algorithms 
- 11/06/2020 - Coding Standaqrds and Linting
- 11/13/2020 - write/debug the algorithm 
- 11/20/2020 - Code Review 
- 12/04/2020 - Presentation Video Due
- 12/09/2020 - Final Report and Code due.




---
# Anticipating Challenges  

I need to learn how to read very large JSON in python (>10G), how to pre-process them efficiently, how to use Github to manage my code files and resources, and how to turn the mathematics language into python codes.

I suspect I will encounter two main challenges. 1) pre-processing data directly with the JSON format because it is hard to generate large data frames using pandas; 2) feature extractions


----

# Proposal Grading Rubric
The following basic grading rubric was used last semester.  It may change slightly but should give an idea of what is considered important. 

    Grading Overall
    10 points - Project title
    10 points - Descriptive picture
    20 points - Overview
    20 points - Program Description
    20 points - Project Goals / Timeline
    20 points - Anticipating Challenges

    Grading Rubric
    -5 Leaving in instructions in report.
    -5 Sloppy formatting