Python scripts to analyze articles from exxpress with basic stylometric features, possibly identifying stylistic changes over time (e.g., from generative AI use).
- Retrieve all articles from
exxpress.atvia the WordPress REST API. - Count articles per year.
- Extract and analyze all "Native Ad" articles.
- Perform stylometric analysis on categories (e.g., average word count, sentence count, sentence length, and lexical diversity) by month.
You'll need Python (version 3.7 or higher recommended). Install dependencies within a virtual environment (venv) for easy management.
First, download or clone the repository to your computer.
git clone https://github.com/yourusername/express.git
cd expressSet up a Python virtual environment (venv) to manage dependencies.
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateWith the virtual environment activated, install the necessary packages.
pip install -r requirements.txtOr manually install the packages:
pip install requests nltk pandas matplotlibSome NLTK resources are required for tokenizing and stopwords. Run the script below to ensure all resources are downloaded.
python3 -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('punkt_tab')"Run the crawl.py script to download all articles up to the current date from the exxpress API. This saves a JSON file named express.json with the downloaded articles.
python3 crawl.pyRun the count.py script to output the number of articles published each year.
python3 count.pyTo extract all articles tagged as "Native Ad" into a separate JSON file, run nativead.py. This will create a file called native-ad.json with only Native Ad articles.
python3 nativead.pyRun the analyze.py script to analyze each article's stylometric features per category per month. This outputs:
category_monthly_stats.xlsx– An Excel file with monthly statistics for each category.- Line charts for each stylometric feature saved as
.pngfiles.
python3 analyze.py- category_monthly_stats.xlsx: Monthly statistics for each category, with features like average word count, sentence count, sentence length, and lexical diversity.
- Feature Plots: PNG images, each plotting a stylometric feature over time for different categories.
- native-ad.json: A JSON file containing only Native Ad articles.
This project is licensed under the MIT License.