# Project 1: Web scraping and basic summarization
*University of Ljubljana, Faculty for computer and information science* <br />
*Course: Introduction to data science*



The idea of this Project is to automatically retrieve structured data from pages [rtvslo.si](https://www.rtvslo.si).

## Environment setup

To setup environment we first need to install conda. Conda can be downloaded from: https://docs.conda.io/projects/conda/en/latest/user-guide/install/download.html . Then we need to make new evironment. I used name "upv-project_1" but it can be arbitrary name. Use commands from below to setup environment. 

`ENVIRONMENT SETUP DESCRIPTIONS:
conda create --name upv-project_1
conda activate upv-project_1
conda install python
conda install selenium
conda install jupyter notebook
ipython kernel install --name "upv-project_1" --user
conda install pandas
conda install requests
conda install numpy`

## Web scraping

We will scrape all whole pages until 1000th article with search key "koronavirus" from  [rtvslo.si](https://www.rtvslo.si) and save data in JSON format.


JSON Schema:

```
[
  {
    "author": ["author_1", "author_2",...],
    "day_published": "DD.MM.YYYY",
    "changed_later": "YES"/"NO", 
    "title": "article_title",
    "subtitle": "article_subtitle",
    "tags": ["tag_1", "tag_2",...],
    "section_tag": "section_tag"
    "content": "article_text",
    "comments": [
        {
            "user": "user_name",
            "date_hour": ["DD.MM.YYYY"; "MM:HH"],
            "grade": comment_grade(as number),
            "reply": "YES"/"NO",
            "comment": "comment_text",

        },...
    ],
    "hour_published": "MM:HH"
    
  }, 
  {
    ...
]


```


In block below are listed all libraries and helper functions needed. Helper functions are imported from helper_f.py file.

In [1]:
# Load all the libraries needed for running the code chunks below

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import os
from selenium.webdriver.common.by import By
import json
import numpy as np
from helper_f import links,  get_article_data,  get_article_comments,  edit_and_join, get_data_skit, get_data_dostopno, get_data_enostavno, make_json_format, change_day_published


19


First we need to get urls of all articles. Function "links" simulates typing search key "koronavirus" in search bar and navigates trough pages and collects urls of most recent articles and saves them into file "article_urls.txt". Every time we run it we can get new different urls.

In [2]:
#COMMENTED TO CAN NOT ACCIDENTALY OVERWRITE URLS WITH  NEW, MORE RECENT ONES.

main_URL = "https://www.rtvslo.si"
num = 1000 #number of articles, max on rtvslo is 1000
main_key = "Koronavirus"


#links(main_URL, main_key, num) #get 1000 article urls for search key "Koronavirus" and write it into article_urls.txt file


Now we can read url by url from "article_urls.txt" file. I divided urls in 4 categories, based on webpage style and features. Majority of web pages is in let's say normal rtvslo article format (those also can have comments). But some of them are in others "SKIT", "ENOSTAVNO" and "DOSTOPNO" cathegory. Each of them have unique source code. Pages will be scraped depending on cathegory by functions: get_article_data (for normal articles), get_data_skit, get_data_dostopno and get_data_enostavno. Comments from "normal" rtvslo articles wil be collected with function get_article_comments. Data from each article will be saved into own .json file in json folder with make_json_format function . Before scraping each article script will check wether article was already scraped or not. So in case of some interuption it wont need to start from beginning again. 

In [3]:
data = []
data_urls = [] #normal rtvslo articles with no SKIT, ENOSTAVNO, DOSTOPNO articles

enostavno = []
skit = []
dostopno = []
i = 0
print("Started scraping")
file = open('article_urls.txt', 'r', encoding='utf-8')
lines = file.readlines()
for line in lines: #urls 
    if "/enostavno/" in line:
        enostavno.append(line[:-1])
    elif "/skit/" in line:
        skit.append(line[:-1]) #\n has to be removed
    elif "/dostopno/" in line:
        dostopno.append(line[:-1])
    else: #normal ones
        data_urls.append(line[:-1])
        #print("Starting to write url number ", )

        
u = len(data_urls)
s = len(skit)
d = len(dostopno)
e = len(enostavno)
        
for j in range(u): #len data_urls
    if os.path.isfile('json/{}.json'.format(j)): #check if article data was already scraped
        pass
    else: #get article and comments data

        (authors, title, subtitle, date, hour, change, article_tags, section_tag, text) = get_article_data(data_urls[j])
        comments = get_article_comments(data_urls[j])
        line_data = make_json_format(authors, title, subtitle, date, hour, change, article_tags, section_tag, text, comments)
        
        data.append(line_data)


        #write to seperate .json file
        with open("json/{}.json".format(j), "w", encoding='utf-8') as outfile:
            json.dump(line_data, outfile, ensure_ascii=False)
            
print("END OF MAIN SCRAPING!!!")


for i in range(s): #len skit
    if os.path.isfile('json/{}.json'.format(u + i)): #check if article data was already scraped
        pass
    else:
        (authors, title, subtitle, date, hour, change, article_tags, section_tag, text, comments) = get_data_skit(skit[i])
        line_data = make_json_format(authors, title, subtitle, date, hour, change, article_tags, section_tag, text, comments)
        
        data.append(line_data)


        #write to seperate .json file
        with open("json/{}.json".format(i + u), "w", encoding='utf-8') as outfile:
            json.dump(line_data, outfile, ensure_ascii=False)
            
print("END OF SKIT SCRAPING!!!")




for i in range(d):
    if os.path.isfile('json/{}.json'.format(u + s + i)): #check if article data was already scraped
        pass
    else:
        (authors, title, subtitle, date, hour, change, article_tags, section_tag, text, comments) = get_data_dostopno(dostopno[i])
        line_data = make_json_format(authors, title, subtitle, date, hour, change, article_tags, section_tag, text, comments)
        
        data.append(line_data)


        #write to seperate .json file
        with open("json/{}.json".format(i + u + s), "w", encoding='utf-8') as outfile:
            json.dump(line_data, outfile, ensure_ascii=False)
            
print("END OF DOSTOPNO SCRAPING!!!")


for i in range(e):
    if os.path.isfile('json/{}.json'.format(u + s + i + d)): #check if article data was already scraped
        pass
    else:
        (authors, title, subtitle, date, hour, change, article_tags, section_tag, text, comments) = get_data_enostavno(enostavno[i])
        line_data = make_json_format(authors, title, subtitle, date, hour, change, article_tags, section_tag, text, comments)
        
        data.append(line_data)


        #write to seperate .json file
        with open("json/{}.json".format(i + u + s + d), "w", encoding='utf-8') as outfile:
            json.dump(line_data, outfile, ensure_ascii=False)
            
print("END OF ENOSTAVNO SCRAPING!!!")



        
    

Started scraping
END OF MAIN SCRAPING!!!
END OF SKIT SCRAPING!!!
END OF DOSTOPNO SCRAPING!!!
END OF ENOSTAVNO SCRAPING!!!


It takes a lot of time to scrape all the data. I found some typing mistakes but decided not to scrape again but to fix already scraped data. Function in block below edit each .json file and then saves it into folder edited_json_files. It also merges all edited files togather into one big file called data.json

In [4]:
edit_and_join() #fix typos and join all single article json files into one big json file with all the data.
print("Json files fixed and joined into data.json file.")

Json files fixed and joined into data.json file.


## Basic summarization

Prepare and show at least five basic visualizations of the extracted data as presented in the chapter *Summarizing data - the basics* of the course's e-book. Explain each visualization of the data.

In [5]:
# Read data from JSON

data = ...

### Visualization 1

`TODO: name the visualization and describe it`

In [6]:
# Visualization 1 code

...

Ellipsis

### Visualization 2

`TODO: name the visualization and describe it`

In [7]:
# Visualization 2 code

...

Ellipsis

### Visualization 3

`TODO: name the visualization and describe it`

In [8]:
# Visualization 3 code

...

Ellipsis

### Visualization 4

`TODO: name the visualization and describe it`

In [9]:
# Visualization 4 code

...

Ellipsis

### Visualization 5

`TODO: name the visualization and describe it`

In [10]:
# Visualization 5 code

...

Ellipsis