# Project introduction

China is one of the countries facing low birth rate issues, meanwhile, China's population decreased by 850,000 in 2022, marking the first negative growth in over 60 years despite years of work trying to avoid the situation. China repealed the One-Child policy in 2016 and launched birth-encouraging policies in earlier years. However, the birth rate has not increased as the government anticipated. (https://www.npr.org/2023/01/17/1149453055/china-records-1st-population-fall-in-decades-as-births-drop) <br>

The topic of this project will understand the attitude toward the government's birth-encouraging policies online among the Chinese people. Intuitively, it is not hard to imagine that the general opinion may not be very supportive. Still, I am interested in the degree of dissatisfaction across time and what aspects people discussed most when commenting on the topic. To do so, I went through the process of data collection, data cleaning, and text analysis.  <br>

For text analysis, I performed sentimental analysis to answer my question of people’s degree of satisfaction/ dissatisfaction from 2021-2023, used word clouds to demonstrate the overall most-frequent word, and used TF-IDF to identify year-specific keywords and further interpret possible reasons behind it.



# Script 1: Data Collection

## Main reference for this script
The below code is adapted on script from this link https://zhuanlan.zhihu.com/p/443802888, there are a lot of changes as weibo changes their anti-crawling algorithen often. <br> Also, I also ask my friend Henry Hsieh for creating this script, thank him for his patience and kindness, this is his github: https://github.com/Hsieh-Cheng-Han

## Decide what weibo comments to scrape

The second step of the analysis is to decide the particular posts and their comments for analysis. For this step, I leverage a website (Link: https://weibo.zhaoyizhe.com/#) that recorded the most popular Weibo (ranked by view times in 24 hours) and searched birth-policy-related words to find the most popular posts and selected the ones with the most comments for web scraping, the URLs I selected to scrape as below:<br>
1. https://m.weibo.cn/detail/4727660435213482
2. https://m.weibo.cn/detail/4635622721458463
3. https://m.weibo.cn/detail/4643035767642831
4. https://m.weibo.cn/detail/4726596705715427
5. https://m.weibo.cn/detail/4660999765890124
6. https://m.weibo.cn/status/4635609970509108
7. https://m.weibo.cn/detail/4660990257666817
8. https://m.weibo.cn/status/4875327685790516
9. https://m.weibo.cn/detail/4881790165059616
10. https://m.weibo.cn/detail/4734289247207643
11. https://m.weibo.cn/detail/4734283263247434
12. https://m.weibo.cn/detail/4734299180368580<br>
After selecting the posts, I create the script below to scrape the comments and save them as an Excel file to easily eyeball the results and perform preliminary data cleaning and combination through Excel. Then, I convert the Excel files into txt format through the below code and upload them to a folded name “Weibo by year” in jupyternotebook.

# Web scraping framework

Unfortunately, the main data sources I wish to use for the analysis do not provide API for researchers, and hence required me to scrape data. Below is a simple framework to demonstrate the main steps of web scraping:
1. Identify the data to be scraped: The first step is to decide what data you want to obtain. For this project, I selected two candidates and extracted the text only. 
2. Select a web scraping tool: The second step is to choose a web scraping tool. In class we used BeatuifulSoup as an example. However, due to the website's structure, I used Regular Expression (regex/re) to extract the content. 
3. Identify the website's structure: The third step is to investigate the website structure and source data to identify where the content I want is located. I performed this by leveraging Chrome’s developer tools to view the HTML and CSS> 
4. Write a web scraping script: The fourth step is to create a web scraping script based on my understanding from step 3. We did a similar thing with the Met website by using requests, parsing, etc., in class. 
5. Clean and store the data: Clean and process the extracted data to make it usable for your analysis or application. Store the data in a structured format, such as a CSV or database, for easy retrieval and analysis.

# Determine the data source: Weibo vs Zhihu

To obtain the data representing the public’s opinion most, I selected Weibo (China’s Twitter) and Zhihu (China’s Quora) to compare and eventually used Weibo. The main difference between the two is that Weibo is quantitatively more advantageous in comments, while Zhihu is better in the quality of response. In other words, Weibo may contain more comments from more people, but it may be short lol. However, for Zhihu, there is like to have more discussions with keywords such as “not practical” and “should do XX instead of YYY.”<br>
As I was unsure which data source would provide a better result, I scraped both data sources, ran tokenization and sentimental analysis to compare, and decided to proceed with Weibo as it has more in quantity and a more balanced amount of data across the three years. I will describe only the process of how I selected which Weibo to scrape below. If you are interested in the process of scaping Zhihu and the data quality check, please refer to the appendix.

# Step1: Import libraries

In [None]:
import re
import time
import requests
from openpyxl import workbook

# Step2: Create a function to scrape Weibo comment

Weibo is known to be hard to scrape as its data is extremely valuable. I tried to battle with the website for a couple of days and then realized that the main trick: use the mobile url but not the web url. The mobile link is not well-protected as a web url. All I need to do is to provide my login info. However, there are also drawbacks to using mobile url to scrape. It is much slower in speed and often breaks down. I did my best to collect as many comments as my computer and time allowed.

In [1]:
def weibo(mid):
    url1 = "https://m.weibo.cn/comments/hotflow"
    # root url
    params1 = {
        "id": mid,
        "mid": mid,
        "max_id_type": "0"
    }
    # set the request parameters
    headers1 = {
        'referer': 'https://m.weibo.cn/detail/4727660435213482?sudaref=m.weibo.cn',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
    }
    # weibo requires you to log in to view the comment, so setting my log in info here
    r1 = requests.get(url1, headers=headers1, params=params1).json()
    # extract page infos
    uid = r1['data']['data'][0]['analysis_extra']
    uid = re.findall('\d+', uid)[0]
    max_id = '0'
    url = "https://weibo.com/ajax/statuses/buildComments"
    headers = {
        'cookie': 'SINAGLOBAL=8515648282164.871.1666852129080; UOR=,,www.baidu.com; ULV=1680064186885:14:13:6:260922579667.9904.1680064186858:1679991894944; XSRF-TOKEN=4WYJ2HnYjeNQurJxrds3VP9x; SCF=AkTKZXmPlQFbWGbhqaBkHERup0-oJgr3GUuV1YeajJn_MdKUZyckI6tPmN76Eb7GeK_RWSzxzEHXyQ8PIKyLtVE.; SUB=_2A25JJ6J4DeRhGeFJ41sS9C7Ozz2IHXVqVJSwrDV8PUNbmtANLWH7kW9NfvbT3JC2MFhzX8uTCOBK8zDuqDtHtsMu; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5C6rxAfHsjpOHRQ6ImUVsu5JpX5KzhUgL.FoMN1h.0Sh5ESh22dJLoIERLxKnLBK2L12eLxKnL1h5L1h-LxK-L1-BLBKB_i--fi-82iK.7i--NiKyFiKnE; ALF=1711605160; SSOLoginState=1680069160; WBPSESS=Dt2hbAUaXfkVprjyrAZT_K7wd_BdTIfuEX6in29Oo1z_Mg08JjzTLAekKHQBMUQjbpWPgTbzu_3khRhOd61INlydGZoCGZaIhQRhJm5Nq1cTi4-A6ATPx9gnCqmBgcxQ8BRVeRT3jq_7KD9XgSNb7msAfU_BhIxkaNevuqiRcRtBvrZEcYur_HSDVKiKj_pRCCTdhVmz0afQw-tgFVKg6g==',
        'referer': 'https://weibo.com/1642512402/MyjJCleeI?refer_flag=1001030103_',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',

    }
    # details of my weibo info, I used a window computer to do this as I don't want to log in weibo on my mac
    dic = {'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04', 'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
           'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'}
    p = 1
    while 1:
        li.append(max_id)
        params = {
            "flow": "0",
            "is_reload": "1",
            "id": mid,
            "is_show_bulletin": "2",
            "is_mix": "0",
            "max_id": max_id,
            "count": "20",
            "uid": uid,
            "fetch_level": "0"
        }
        if p == 1:
            params = {
                "is_reload": "1",
                "id": mid,
                "is_show_bulletin": "2",
                "is_mix": "0",
                "count": "10",
                "uid": uid,
                "fetch_level": "0"
            }
        r = requests.get(url, headers=headers, params=params).json()
        # print(r)
        max_id = r['max_id']
        # print(max_id)
        if max_id in li:
            break
        data = r['data']
        for i in data:
            text = i['text']
            pattern1 = re.compile(r'<[^>]+>')  # match html tags
            pattern2 = re.compile(r'[\s+\/_,$%^(MISSING)*(+]+|[+——￥%!…(MISSING)…&]+')  # match special symbols
            text = pattern1.sub('', text)  # remove html tags
            text = pattern2.sub('', text)  # remove special symbols
            times = i['created_at'].split(' ')
            date = times[-1] + '-' + dic[f'{times[1]}'] + '-' + times[2]
            ws.append([text, date])
            print(text)
            print(date)
        p += 1



# Step 3: Run the functions

In [None]:

if __name__ == '__main__':
    wb = workbook.Workbook()  # create an excel work book
    ws = wb.active  # activate the weibo urls I want to scrape
    ws.append(['评论', '日期'])  # append the text and date of the comments from the urls
    li = []
    mid = input('请输入uid:')
    weibo(mid)
    wb.save(f'微博{mid}.xlsx')

# Step 4: Manually check the excel sheets and aggregate into by-year files

For this step, I opened all the excel files created from above to take a quick look of the data and manually categorized them by year. After converting into by-year fiele, I used the code in step 5 to convert them into txt for ease of text analysis.

# Step 5: Converting xlsx to txt

In [None]:
# this is just a generic process of converting xlsx to txt for ease of analysis
import pandas as pd
xl = pd.ExcelFile('combined_weibo_comment.xlsx')

for sheet in xl.sheet_names:
    file = pd.read_excel(xl, sheet_name = sheet)
    file.to_csv(sheet+'.txt', header = False, index = False)