Skip to content

cwl286/Data_challenges

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Newsfeed-Automation

Python

This is a demo to use docker-compose to build up a newsfeed automation website. It is implemented by using Python Flask(web server) and MSSQL(database).

Folder structure

Newsfeed

├──README.md
└──src
     └──data_challenges.py
     └──data_challenges.ipynb
     └──dewikinews-20190720-pages-articles-multistream.xml
     └──requirements.txt
     └──unigram_df_example.csv
     └──bigram_df_example.csv
└──example
     └──unigram_df(top100_example).csv
     └──bigram_df(top100_example).csv
     └──top50_bigram.png
     └──top100_bigram.png

Challenges

  • Download the German Wikipedia xml: https://dumps.wikimedia.org/dewikinews/20190720/dewikinews-20190720-pages-articles-multistream.xml.bz2
  • Extract the text from every (XML page > text entity)
  • Remove markdown and formatting
    (Try to preserve the textual description of links (so not the link target but the description)
  • Remove punctuation
  • Remove German stopwords
  • Replace German Umlauts (ö -> oe, ü -> ue, ä -> ae, Ö -> Oe, Ü -> Ue , Ä -> Ae, ß -> ss)
  • Combine bigrams with more than 5 occurrences
  • Plot the 10 most frequent words (x-axis words, y-axis frequency)
  • Plot the 10 most frequently occurring bigrams (a new count, not the bigrams from the previous step) and plot them
  • Plot the 10 most frequent words at the beginning of a sentence

Note

  • To save time, it is possible to set a random number of pages in the xml for processing.
    • random_pages = 100 (in the code)
  • It saves the bigram and unigram in csv when processing. One may see the example in the example folder.
  • The bigram network graph of page index 1 - 100 is also attached in the example directory.

After visualizing the data, we can see that there are a lot of date and category information. This is because of the description in the markdown links. In the data cleaning stage, we only removed Javascript, CSS, several HTML tags, markdown syntax and URLs, but not the description. Therefore, there exist many of technical terms.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published