GitHub

Newsfeed-Automation

Python

This is a demo to use docker-compose to build up a newsfeed automation website. It is implemented by using Python Flask(web server) and MSSQL(database).

Folder structure

Newsfeed

├──README.md
└──src
     └──data_challenges.py
     └──data_challenges.ipynb
     └──dewikinews-20190720-pages-articles-multistream.xml
     └──requirements.txt
     └──unigram_df_example.csv
     └──bigram_df_example.csv
└──example
     └──unigram_df(top100_example).csv
     └──bigram_df(top100_example).csv
     └──top50_bigram.png
     └──top100_bigram.png

Challenges

Download the German Wikipedia xml: https://dumps.wikimedia.org/dewikinews/20190720/dewikinews-20190720-pages-articles-multistream.xml.bz2
Extract the text from every (XML page > text entity)
Remove markdown and formatting
(Try to preserve the textual description of links (so not the link target but the description)
Remove punctuation
Remove German stopwords
Replace German Umlauts (ö -> oe, ü -> ue, ä -> ae, Ö -> Oe, Ü -> Ue , Ä -> Ae, ß -> ss)
Combine bigrams with more than 5 occurrences
Plot the 10 most frequent words (x-axis words, y-axis frequency)
Plot the 10 most frequently occurring bigrams (a new count, not the bigrams from the previous step) and plot them
Plot the 10 most frequent words at the beginning of a sentence

Note

To save time, it is possible to set a random number of pages in the xml for processing.
- random_pages = 100 (in the code)
It saves the bigram and unigram in csv when processing. One may see the example in the example folder.
The bigram network graph of page index 1 - 100 is also attached in the example directory.

After visualizing the data, we can see that there are a lot of date and category information. This is because of the description in the markdown links. In the data cleaning stage, we only removed Javascript, CSS, several HTML tags, markdown syntax and URLs, but not the description. Therefore, there exist many of technical terms.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
example		example
src		src
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Newsfeed-Automation

Folder structure

Challenges

Note

About

Releases

Packages

Languages

License

cwl286/Data_challenges

Folders and files

Latest commit

History

Repository files navigation

Newsfeed-Automation

Folder structure

Challenges

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages