# Data Cleaning

## Introduction

This notebook goes through the steps taken with the data collected in order to get cleaned organized data in two standard text formats. The notebook will contain the nexts steps.

1. **Cleaning the data -** I will use text pre-procesing techniques to get the dta into shape.
2. **Organizing the data -** I'l organize the data into a way that is easy to input into other algoithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of texts
2. **Document-Term Matrix** - words counts in matrix format

### Problem Statement

My goal is to look look a the latest headlines of the main newspapers in Perú and note simmilarities and differences.

#### Imports

In [1]:
import json
import logging
import os
import sys
from datetime import datetime
from urllib.parse import urlencode

import requests
from dotenv import load_dotenv

#### Configuration options

In [2]:
log_format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

logging.basicConfig(
    stream = sys.stdout, 
    format = log_format, 
    level = logging.INFO
)

logger = logging.getLogger()

## Getting the data

The data was taken from the [Twitter API](https://developer.twitter.com) and the newspaper where selected according to a list found in [Diarios de Perú](http://www.diariosdeperu.com.pe) as well as annecdotal experience. The journals to investigate:

| Newspaper | Twitter handle |
| ----------- | ----------- |
| El Comercio | elcomercio_peru |
| La República | larepublica_pe |
| Perú 21 | peru21noticias |
| Trome | tromepe |
| Gestión | Gestionpe |
| Diario Correo | diariocorreo |
| Diario Expreso | ExpresoPeru |
| Diario Ojo | diarioojo |
| Diario El Peruano | DiarioElPeruano |
| Diario La Razón | larazon_pe |

In [3]:
load_dotenv()

BASE_DIR = os.environ.get("BASE_DIR")
BEARER_TOKEN = os.environ.get("BEARER_TOKEN")

### Getting the user ids

In [4]:
newspapers = [
    "elcomercio_peru",
    "larepublica_pe",
    "peru21noticias",
    "tromepe",
    "Gestionpe",
    "diariocorreo",
    "ExpresoPeru",
    "diarioojo",
    "DiarioElPeruano",
    "larazon_pe",
    "elbuho_pe",
    "ensustrece"
]
headers = {"Authorization": f"Bearer {BEARER_TOKEN}"}

In [5]:
newspapers_id = {}

for newspaper in newspapers:
    username_url = f"https://api.twitter.com/2/users/by/username/{newspaper}"
    response = requests.get(username_url, headers=headers)

    newspapers_id[newspaper] = response.json()["data"]["id"]

with open(f'{BASE_DIR}/data/raw/newspapers_id.json', 'w') as write_file:
    json.dump(newspapers_id, write_file)

### Getting tweets from newspapers

In [5]:
with open(f'{BASE_DIR}/data/raw/newspapers_id.json', 'r') as read_file:
    newspapers_id = json.load(read_file)

I'm choosing start time and end time of tweets in order to be able to select tweets from different times and be actively collecting tweets as the months go.

In [6]:
for newspaper, newspaper_id in newspapers_id.items():
    query = {
    "max_results": 100,
    "tweet.fields": "id,text,created_at,public_metrics,possibly_sensitive,referenced_tweets",
    "start_time": "2022-11-28T00:00:00Z",
    "end_time": "2022-12-05T00:00:00Z",
    }
    payload = urlencode(query, safe=",:")

    DATA_DIR = f"{BASE_DIR}/data/raw"
    SAVED_DATE = datetime.strptime(query["end_time"][0:10],"%Y-%m-%d")
    
    tweets_url = f"https://api.twitter.com/2/users/{newspaper_id}/tweets"
    response = requests.get(tweets_url, headers=headers, params=payload)

    logger.info(f"{newspaper}: status code {response.status_code}")

    try:
        response_data = response.json()["data"] # List of tweets
        response_meta = response.json()["meta"]
    except KeyError:
        logger.info(f"{newspaper}: No data found!")
        continue

    try:
        next_token = response_meta["next_token"]
        query["pagination_token"] = next_token
        payload = urlencode(query, safe=",:")
    except KeyError:
        logger.info(f"{newspaper}: No MORE data found!")

        with open(f"{DATA_DIR}/{SAVED_DATE.isocalendar().year}w{SAVED_DATE.isocalendar().week}_data_{newspaper}.json", "w") as write_file:
            json.dump({"data": response_data}, write_file)
        logging.info(f"{newspaper}: Saved! FIRST")
        continue

    while True:
        logger.info(f"{newspaper}: New page")
        new_response = requests.get(tweets_url, headers=headers, params=payload)

        try:
            response_data += new_response.json()["data"]
            response_meta = new_response.json()["meta"]

            next_token = response_meta["next_token"]
            query["pagination_token"] = next_token
            payload = urlencode(query, safe=",:")
        except KeyError:
            logger.info(f"{newspaper}: No MORE data found!")
            
            with open(f"{DATA_DIR}/{SAVED_DATE.isocalendar().year}w{SAVED_DATE.isocalendar().week}_data_{newspaper}.json", "w") as write_file:
                json.dump({"data": response_data}, write_file)
                logging.info(f"{newspaper}: Saved!")
            break


2022-12-12 11:52:04,982 - root - INFO - elcomercio_peru: status code 200
2022-12-12 11:52:04,982 - root - INFO - elcomercio_peru: New page
2022-12-12 11:52:05,717 - root - INFO - elcomercio_peru: New page
2022-12-12 11:52:06,424 - root - INFO - elcomercio_peru: New page
2022-12-12 11:52:07,385 - root - INFO - elcomercio_peru: New page
2022-12-12 11:52:08,125 - root - INFO - elcomercio_peru: New page
2022-12-12 11:52:08,959 - root - INFO - elcomercio_peru: New page
2022-12-12 11:52:09,796 - root - INFO - elcomercio_peru: New page
2022-12-12 11:52:10,377 - root - INFO - elcomercio_peru: No MORE data found!
2022-12-12 11:52:10,398 - root - INFO - elcomercio_peru: Saved!
2022-12-12 11:52:11,305 - root - INFO - larepublica_pe: status code 200
2022-12-12 11:52:11,310 - root - INFO - larepublica_pe: New page
2022-12-12 11:52:12,170 - root - INFO - larepublica_pe: New page
2022-12-12 11:52:13,047 - root - INFO - larepublica_pe: New page
2022-12-12 11:52:13,947 - root - INFO - larepublica_pe: N