Skip to content

Persian raw text - حدود ۸۰ گیگابایت متن خام فارسی

Notifications You must be signed in to change notification settings

persiannlp/persian-raw-text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 

Repository files navigation

Persian Raw Text - متن خام فارسی

The package contains a huge amout of Persian text, collected from the following sources:

Each resource is modified to exclude non-text content (urls, html, non-utf-8 content, etc). I have also dropped the lines that do not contain any Persian text. I have not done any deduplication; so there might be repeated content.

The overall data is here (~70GB, ~13.5million paragraphs).

Note: since the files are relatively large, you probably shouldn't download in your browser. A good way to download the files is to use gsutil (see the here for more). This would give details on the total download size, download progress, etc:

gsutil -m cp -R gs://danielk-files/farsi-text/merged_files/all_text_merged_cleaned.txt  .
Copying gs://danielk-files/farsi-text/merged_files/all_text_merged_cleaned.txt...
/ [0/1 files][600.2 MiB/ 69.8 GiB]   0% Done 

You can also use tools like wget:

$ wget https://storage.googleapis.com/danielk-files/farsi-text/merged_files/commoncrawl_fa_merged.txt 
--2020-05-17 14:53:08--  https://storage.googleapis.com/danielk-files/farsi-text/merged_files/commoncrawl_fa_merged.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68720495550 (64G) [text/plain]
Saving to: ‘commoncrawl_fa_merged.txt.1’

commoncrawl_fa_merged.txt.1                    0%[                                                                                              ] 542.30M  55.9MB/s    eta 17m 44s

Credits

If you find this repo useful, please include a reference to this repository.

About

Persian raw text - حدود ۸۰ گیگابایت متن خام فارسی

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published