-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commands to execute python files? #12
Comments
So far, there's only the list of examples in the README with a short description which data is extracted. In addition, every example will show a command-line help if called with option What do you exactly need?
Let me know what you need! You may also ask for help and support on the Common Crawl forum. |
@sebastian-nagel Thank you for the reply. The first one you mentioned is what I imagined when I wrote the issue. Here is my story of struggle, and it is still going on. You may skip reading this part. So, this is the current bash script:
And I'm getting org.apache.http.conn.ConnectionPoolTimeoutException. I tried to limit executors (somebody on the internet suggested it), but it doesn't work as I expected. The exception is happening at "df = spark.read.load(table_path)" line of sparkcc.py. Thank you for reading! |
Hi @calee88, thanks for the careful report. I've opened #13 and #14 to improve documentation and command-line help. When querying the columnar index (
Let me know whether this works for you! |
Thank you for the reply @sebastian-nagel! I'm using a reliable and fast internet, although I'm far from Northern Virginia. I think the internet should not be a problem here. Have you tried to access remotely using the script I posted? Were you successful? |
Hello @sebastian-nagel. |
Thanks, @calee88, for the feedback. #13 will get addressed soon. Yes, I'm able to run the script
df = sqlContext.read.parquet("spark-warehouse/ccindexwordcount")
for row in df.sort(df.val.desc()).take(10): print("%6i\t%6i\t%s" % (row['val']['tf'], row['val']['df'], row['key']))
...
245 8 hd
154 10 the
97 8 movies
76 10 of
71 8 2019
69 10 and
64 10 to
62 10 online
62 2 football
61 10 free
for row in df.filter(df['key'].contains('ð')).take(10): print("%6i\t%6i\t%s" % (row['val']['tf'], row['val']['df'], row['key']))
...
1 1 tónleikaferð
1 1 annað
1 1 aðalmynd
2 2 sláðu
6 2 vönduð
1 1 viðeigandi
5 2 iðnó
2 2 með
2 2 lesið
3 2 jólastuði |
Thank you for the reply @sebastian-nagel. Athena seems much faster, so I'll just keep using it. I hope someone find this thread helpful. |
It would have been helpful, if there were some command examples for each .py files.
Or am I not finding those?
For now, I need to read every line of codes to understand the examples.
Still, I appreciate the examples, it would be much harder without the examples.
The text was updated successfully, but these errors were encountered: