-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions regarding Zipf 's Law #36
Comments
I see that your question got deleted on Stack Overflow before I could get to it; sorry about that. 😕 To use library(tidyverse)
library(tidytext)
data("data_corpus_inaugural", package = "quanteda")
inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)
ap_td <- tidy(inaug_dfm)
ap_td
#> # A tibble: 44,725 x 3
#> document term count
#> <chr> <chr> <dbl>
#> 1 1789-Washington fellow 3
#> 2 1793-Washington fellow 1
#> 3 1797-Adams fellow 3
#> 4 1801-Jefferson fellow 7
#> 5 1805-Jefferson fellow 8
#> 6 1809-Madison fellow 1
#> 7 1813-Madison fellow 1
#> 8 1817-Monroe fellow 6
#> 9 1821-Monroe fellow 10
#> 10 1825-Adams fellow 3
#> # ... with 44,715 more rows Notice that here, you have a tidy data frame with one word per row, but it is not ordered by Instead, we can arrange this by descending count. ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count))
ap_td
#> # A tibble: 44,725 x 3
#> # Groups: document [58]
#> document term count
#> <chr> <chr> <dbl>
#> 1 1841-Harrison the 829
#> 2 1841-Harrison of 604
#> 3 1909-Taft the 486
#> 4 1841-Harrison , 407
#> 5 1845-Polk the 397
#> 6 1821-Monroe the 360
#> 7 1889-Harrison the 360
#> 8 1897-McKinley the 345
#> 9 1841-Harrison to 318
#> 10 1881-Garfield the 317
#> # ... with 44,715 more rows Now we can use ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count)) %>%
mutate(rank = row_number(),
total = sum(count),
`term frequency` = count / total)
ap_td
#> # A tibble: 44,725 x 6
#> # Groups: document [58]
#> document term count rank total `term frequency`
#> <chr> <chr> <dbl> <int> <dbl> <dbl>
#> 1 1841-Harrison the 829 1 9178 0.09032469
#> 2 1841-Harrison of 604 2 9178 0.06580954
#> 3 1909-Taft the 486 1 5844 0.08316222
#> 4 1841-Harrison , 407 3 9178 0.04434517
#> 5 1845-Polk the 397 1 5211 0.07618499
#> 6 1821-Monroe the 360 1 4898 0.07349939
#> 7 1889-Harrison the 360 1 4744 0.07588533
#> 8 1897-McKinley the 345 1 4383 0.07871321
#> 9 1841-Harrison to 318 4 9178 0.03464807
#> 10 1881-Garfield the 317 1 3240 0.09783951
#> # ... with 44,715 more rows
ap_td %>%
ggplot(aes(rank, `term frequency`, color = document)) +
geom_line(alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() |
Dear Julia
I am not sure whether sending the Email will reach you or not. I would
like to say thank you very much for your reply.
I deleted the post on the forum and reposed again yesterday.
Thank you
Bun
2017-08-05 20:51 GMT+01:00 Julia Silge <notifications@github.com>:
… I see that your question got deleted on Stack Overflow before I could get
to it; sorry about that. 😕
To use row_number() to get rank, you need to make sure that your data
frame is ordered by n, the number of times a word is used in a document.
Let's look at an example. It sounds like you are starting with a
document-term matrix that you are tidying? (I'm going to use some example
data that is similar to a DTM from quanteda.)
library(tidyverse)
library(tidytext)
data("data_corpus_inaugural", package = "quanteda")inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)
ap_td <- tidy(inaug_dfm)ap_td#> # A tibble: 44,725 x 3#> document term count#> <chr> <chr> <dbl>#> 1 1789-Washington fellow 3#> 2 1793-Washington fellow 1#> 3 1797-Adams fellow 3#> 4 1801-Jefferson fellow 7#> 5 1805-Jefferson fellow 8#> 6 1809-Madison fellow 1#> 7 1813-Madison fellow 1#> 8 1817-Monroe fellow 6#> 9 1821-Monroe fellow 10#> 10 1825-Adams fellow 3#> # ... with 44,715 more rows
Notice that here, you have a tidy data frame with one word per row, but it
is not ordered by count, the number of times that each word was used in
each document. If we used row_number() here to try to assign rank, it
isn't meaningful because the words are all jumbled up in order.
Instead, we can arrange this by descending count.
ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count))
ap_td#> # A tibble: 44,725 x 3#> # Groups: document [58]#> document term count#> <chr> <chr> <dbl>#> 1 1841-Harrison the 829#> 2 1841-Harrison of 604#> 3 1909-Taft the 486#> 4 1841-Harrison , 407#> 5 1845-Polk the 397#> 6 1821-Monroe the 360#> 7 1889-Harrison the 360#> 8 1897-McKinley the 345#> 9 1841-Harrison to 318#> 10 1881-Garfield the 317#> # ... with 44,715 more rows
*Now* we can use row_number() to get rank, because the data frame is
actually ranked/arranged/ordered/sorted/however you want to say it.
ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count)) %>%
mutate(rank = row_number(),
total = sum(count),
`term frequency` = count / total)
ap_td#> # A tibble: 44,725 x 6#> # Groups: document [58]#> document term count rank total `term frequency`#> <chr> <chr> <dbl> <int> <dbl> <dbl>#> 1 1841-Harrison the 829 1 9178 0.09032469#> 2 1841-Harrison of 604 2 9178 0.06580954#> 3 1909-Taft the 486 1 5844 0.08316222#> 4 1841-Harrison , 407 3 9178 0.04434517#> 5 1845-Polk the 397 1 5211 0.07618499#> 6 1821-Monroe the 360 1 4898 0.07349939#> 7 1889-Harrison the 360 1 4744 0.07588533#> 8 1897-McKinley the 345 1 4383 0.07871321#> 9 1841-Harrison to 318 4 9178 0.03464807#> 10 1881-Garfield the 317 1 3240 0.09783951#> # ... with 44,715 more rows
ap_td %>%
ggplot(aes(rank, `term frequency`, color = document)) +
geom_line(alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
<https://camo.githubusercontent.com/82d975bc8c3ace9f742460bf0372c3c45bd59b27/687474703a2f2f692e696d6775722e636f6d2f4c6638545332642e706e67>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#36 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AY61H8cB1xxYJBD12HMyNrSqs9jnyiCvks5sVMfPgaJpZM4Oud8T>
.
|
Ah, I just saw that! I'll post the answer there too so people can see it. |
Thank you very much! .... I am studying your answer and trying replicate Zipf's law. |
I tried the code from http://tidytextmining.com/tfidf.html.
My question is: How can I rewrite the code to produce the reverse relationship between the log of term frequency and the log of rank?
The following is the term-document matrix. Any comments are highly appreciated.
I plot the graph and it does look like this one in this thread.
https://i.stack.imgur.com/j2CTf.jpg
Thank you
The text was updated successfully, but these errors were encountered: