# Machine Learning Problem Set

## Overview

The focus of this problem set is on analyzing textual data from scientific papers. The primary aim is to develop an analysis pipeline that reads, pre-processes, analyzes, and displays textual data.

## Tasks

### Data Sourcing

The first stage involves sourcing and reading the textual content of scientific papers. You can find a few example PDF files in the `data/` directory. Please download additional papers of your choice — ensuring you analyze a total of at least 6 papers. The papers should relate to at least 2 different fields of your choosing; this could be within economics, but may also be as broad as medicine and physics. The choice is yours.

Use an appropriate PDF reading library or tool to programmatically extract the text. An example is provided below, but you are free to use any Python library you prefer.

### Pre-processing

Pre-processing is a critical step aimed at cleaning and preparing the text data for analysis. In this problem set, we want to focus on sentences, so you should first tokenize the text into sentences. Further steps include:

- Removing punctuation, numbers, and special characters (e.g. by using regular expressions).
- Converting all text to a uniform case (usually lowercase), so the analysis is not case-sensitive.
- Stop word removal, i.e., eliminating commonly used words (e.g., "and", "the", "is") that do not contribute significantly to the overall meaning and can skew the analysis.
- Other potential pre-processing steps might include stemming and lemmatization, depending on the specific requirements and goals of the analysis (optional).

### Analysis

The final stage is the analysis of the pre-processed text to extract meaningful context. In this problem set, you have to create a t-SNE visualization of the tokens (sentences) that have already been pre-processed. Make sure to label each sentence based on the paper (or field) of its origin. In the t-SNE plot, you should color-code the sentences by paper or field. Ultimately, you should be able to visually infer how similar these papers are in a linguistic sense.

You may pick any method you like (including the use of ChatGPT or OpenAI's API), as long as your approach is clearly documented.

## Submission

Please note that the focus of this case is primarily on the execution of the tasks, **not** on the final results. The methods chosen are therefore of secondary importance and left to your discretion. However, your results must be **reproducible** with the submitted code. Emphasize clean coding, thorough commenting, and the appropriate use of Git/GitHub. Follow the guidelines laid out in [PEP 8 – Style Guide for Python Code](https://peps.python.org/pep-0008/).

Your solutions should be contained in the Jupyter notebook `problemset.ipynb`, while data should be stored in the `data/` folder. **Everything** required to reproduce your results must be committed to your GitHub Classroom repository. For more details, see the "Problem set" section at: [wbk.ing/MachineLearning/](https://wbk.ing/MachineLearning/).


## Extracting text from PDF files

**(This is a suggestion, remove the code if you do not need it ...)**

In [None]:
# Step 1: Install pdfminer.six if you haven't already
# You can install it using conda or pip, see 
  # https://anaconda.org/conda-forge/pdfminer.six
  # https://pypi.org/project/pdfminer.six/

# This will automatically install pdfminer via pip
!pip install pdfminer.six

In [12]:
# Step 2: Import the required modules
import os
import re
from pdfminer.high_level import extract_text

In [13]:
# Files in the data folder
pdf_files = os.listdir(path='data')
print(pdf_files)

['Bajari-MachineLearningMethods-2015.pdf', 'cesifo1_wp6504.pdf', 'SSRN-id3567724.pdf']


In [14]:
# Path to your PDF file
pdf_file_path = os.path.join('data', pdf_files[2])

# Extract text
extracted_text = extract_text(pdf_file_path)

In [15]:
# This is the beginning of the extracted text
print(extracted_text[0:150])

“Let me get back to you” –
A machine learning approach to measuring
non-answers∗

Andreas Barth†

Sasan Mansouri‡

Fabian Woebbeking§

April 1, 2022




## Some pre-processing

In [16]:
# Example: Regex to remove all non-alphabetical characters and replace them with a space
processed_text = re.sub('[^a-zA-Z]', ' ', extracted_text)

print(processed_text[0:150])

 Let me get back to you    A machine learning approach to measuring non answers   Andreas Barth   Sasan Mansouri   Fabian Woebbeking   April          
