<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP (Part 1)

_________________________________


## About the project


This project is about applying the concept on webscraping, APIs and Natural Language Processing (LNP).

We will be scraping data from Reddit by using Pushshift API, afterwhich applying NLP to model to classify a random post it comes from.
_________________________________

## Problem Statement

Stocks and real estate investing are both popular investment options for people to grow their their wealth over time.

Stocks: When people invest in stocks, they're buying a small piece of ownership in a publicly traded company. The goal is to buy stocks at a low price and sell them at a higher price later on, making a profit. However, stocks can be volatile and unpredictable, with prices fluctuating based on factors like company performance, market conditions, and global events.

Real estate investing: This involves buying and managing properties with the goal of earning a profit through rental income, property appreciation, or both. Real estate can be a great long-term investment, but it also requires a significant upfront investment, ongoing maintenance costs, and a knowledge of the local real estate market.


As a data analyst for Kabble Securities, I was tasked to develop a machine learning model that can identify key words related to stocks and real estate investing based on user inputs. The challenge is to accurately predict which investments are likely to perform well in the future based on the discussions and patterns observed. The ultimate goal is to provide the company's clients with data-driven insights that will help them make informed investment decisions. The success of this project will depend on the ability of the machine learning model to accurately identify important signals from the noise of online discussions, and the validity of the underlying assumptions used in the model.



---

## Pushshift API



The scraping data will from `real estate investing` and `stocks` subreddit.

_______
### Import libraries


In [1]:
import requests
import pandas as pd
import datetime as dt 
import json

import warnings
warnings.filterwarnings("ignore")

---
### Scraping from Reddit


#### Real estate investing post

In [2]:
#url
url = 'https://api.pushshift.io/reddit/search/submission'


#Set params
params = {
     'subreddit': 'realestateinvesting',
    'size': 1000}

for i in range(8):
    if i != 0:
        params['before'] = time_stamp
    res = requests.get(url, params)
    print(f'iteration {i}, Status code: {res.status_code}')
    data = res. json()

    time_stamp = data['data'][-1]['created_utc']
    print(f'The timestamp is {time_stamp}')
    posts = data['data']
    if i == 0:
        df = pd.DataFrame(posts)
    else:
        df = pd.concat([df, pd.DataFrame(posts)],ignore_index=True, axis = 0)
            
    print(f'The total amount of post is {len(posts)}')

iteration 0, Status code: 200
The timestamp is 1676845888
The total amount of post is 1000
iteration 1, Status code: 200
The timestamp is 1675050775
The total amount of post is 999
iteration 2, Status code: 200
The timestamp is 1673453766
The total amount of post is 999
iteration 3, Status code: 200
The timestamp is 1671652051
The total amount of post is 999
iteration 4, Status code: 200
The timestamp is 1669763120
The total amount of post is 999
iteration 5, Status code: 200
The timestamp is 1667596351
The total amount of post is 999
iteration 6, Status code: 200
The timestamp is 1446028231
The total amount of post is 1000
iteration 7, Status code: 200
The timestamp is 1224871583
The total amount of post is 445


In [3]:
#Filter out the needs columns from datasets
real_estate = df[['subreddit','title','selftext']]
#Removing all duplicates
real_estate.drop_duplicates(inplace = True)
real_estate.duplicated().sum()

0

#### Stocks post

In [4]:
#url
url = 'https://api.pushshift.io/reddit/search/submission'


#Set params
params = {
     'subreddit': 'stocks',
    'size': 1000}

for i in range(11):
    if i != 0:
        params['before'] = time_stamp
    res = requests.get(url, params)
    print(f'iteration {i}, Status code: {res.status_code}')
    data = res. json()

    time_stamp = data['data'][-1]['created_utc']
    print(f'The timestamp is {time_stamp}')
    posts = data['data']
    if i == 0:
        df = pd.DataFrame(posts)
    else:
        df = pd.concat([df, pd.DataFrame(posts)],ignore_index=True, axis = 0)
            
    print(f'The total amount of post is {len(posts)}')

iteration 0, Status code: 200
The timestamp is 1677158142
The total amount of post is 999
iteration 1, Status code: 200
The timestamp is 1675883410
The total amount of post is 1000
iteration 2, Status code: 200
The timestamp is 1674857190
The total amount of post is 1000
iteration 3, Status code: 200
The timestamp is 1673657171
The total amount of post is 1000
iteration 4, Status code: 200
The timestamp is 1672525682
The total amount of post is 1000
iteration 5, Status code: 200
The timestamp is 1671416827
The total amount of post is 1000
iteration 6, Status code: 200
The timestamp is 1670349892
The total amount of post is 999
iteration 7, Status code: 200
The timestamp is 1669058521
The total amount of post is 1000
iteration 8, Status code: 200
The timestamp is 1667768890
The total amount of post is 1000
iteration 9, Status code: 200
The timestamp is 1471885386
The total amount of post is 1000
iteration 10, Status code: 200
The timestamp is 1469132192
The total amount of post is 999


In [5]:
#Filter out the needs columns from datasets
stocks = df[['subreddit','title','selftext']]
#Removing all duplicates
stocks.drop_duplicates(inplace = True)
stocks.duplicated().sum()

0

---

### Summary of Scraping from Reddit

In [18]:
print(f'Scraped {len(real_estate)} posts for real estate investing')
print(f'Scraped {len(stocks)} posts on for stocks')
print(f'There is {stocks.duplicated().sum()} duplicates for stocks')
print(f'There is {real_estate.duplicated().sum()} duplicates for real estate investing')

Scraped 7287 posts for real estate investing
Scraped 10436 posts on for stocks
There is 0 duplicates for stocks
There is 0 duplicates for real estate investing


---
### Exporting Dataset

In [13]:
real_estate.to_csv('./datasets/real_estate.csv')
stocks.to_csv('./datasets/stocks.csv')

---