# Urban vs. rural, predictive ngrams

How does language use on Twitter vary between urban and rural areas?

There are a number of ways to approach these types of questions about linguistic variation. This notebook takes a maximally simple route, to get a sense of how large the difference is and what it might roughly consist of.

- Working with two buckets of tweets - one from cities with population < 10,000 (rural), and the other from cities with population > 1M (urban) - just fit bag-of-words logistic regressions.
- Then, pull out ngrams with the highest coefficients under the models, words that are most predictive of the two classes.
- As a bonus, by looking at the performance of the models, we can get a sense of the size / consistency of the difference - the degree to which (very) simple classifiers are able to tell the difference between urban and rural.

The data:

- 2017 decahose through September, ~5 billion tweets (excluding RTs).
- Geocoded tweets (city or state): **357,061,859**
- Tweets matched to individual cities: **250,701,213**
- Unique users for geocoded tweets: **12,709,671**
- Tweets matched to cities with population < 10,000 (rural): **18,162,300**
- Tweets matched to cities with population > 1M (urban): **51,694,990**
- Working samples: **~5M** each for urban and rural.

TODO: Convert to Spark MLlib and run on EC2 with full data?

In [2]:
import ujson
import numpy as np
import re
import os

from tqdm import tqdm_notebook
from glob import glob

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [3]:
def read_tweets(pattern, fcount=None):
    
    paths = glob(pattern)
    
    if fcount:
        paths = paths[:fcount]
    
    tweets = []
    for path in paths:
        
        with open(path) as fh:
            for line in fh:
                tweet = ujson.loads(line)
                text = re.sub('(#|@|http)\S+', '', tweet['body'])
                tweets.append(text)
                
    return tweets

In [4]:
rural = read_tweets('../../data/geo-lt10k.json/*.json', 60)

In [5]:
urban = read_tweets('../../data/geo-gt1m.json/*.json', 20)

In [6]:
rural_ = rural[:5000000]

In [7]:
urban_ = urban[:5000000]

In [8]:
X = rural_ + urban_

In [9]:
y = ([0] * len(rural_)) + ([1] * len(urban_))

In [10]:
def train_model(X, y, ngram_range=(1, 1), max_features=1000):
    
    cv = CountVectorizer(
        ngram_range=ngram_range,
        max_features=max_features,
    )
    
    X = cv.fit_transform(X)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    model = LogisticRegression()
    fit = model.fit(X_train, y_train)
    
    y_pred = fit.predict(X_test)
    
    report = classification_report(
        y_test, y_pred, target_names=('rural', 'urban'),
    )
    
    print(report)
    
    return cv, model

In [11]:
def print_rural(model, names, n=100):
    idxs = model.coef_[0].argsort()
    for idx in idxs[:n]:
        print(model.coef_[0][idx], names[idx])

In [12]:
def print_urban(model, names, n=100):
    idxs = np.flip(model.coef_[0].argsort(), 0)
    for idx in idxs[:n]:
        print(model.coef_[0][idx], names[idx])

In [13]:
ng1_cv, ng1_model = train_model(X, y, (1, 1))

             precision    recall  f1-score   support

      rural       0.55      0.63      0.59   1250407
      urban       0.57      0.49      0.53   1249593

avg / total       0.56      0.56      0.56   2500000



In [14]:
ng2_cv, ng2_model = train_model(X, y, (2, 2))

             precision    recall  f1-score   support

      rural       0.55      0.27      0.36   1249919
      urban       0.52      0.79      0.62   1250081

avg / total       0.54      0.53      0.49   2500000



In [15]:
ng3_cv, ng3_model = train_model(X, y, (3, 3))

             precision    recall  f1-score   support

      rural       0.58      0.11      0.18   1248368
      urban       0.51      0.92      0.66   1251632

avg / total       0.54      0.51      0.42   2500000



Why does precision diverge so much in the bigram / trigram models?

# Rural unigrams

In [16]:
names1 = ng1_cv.get_feature_names()

In [17]:
print_rural(ng1_model, names1)

-1.37580556409 temp
-1.3671574219 mph
-1.24361257582 county
-1.11023594242 wind
-0.787436406853 weather
-0.758813997341 pm
-0.664389617678 automatically
-0.634613892563 entered
-0.632820245824 baseball
-0.61283075076 congratulations
-0.557181794553 rain
-0.54778941848 giveaway
-0.546180435827 ny
-0.528216585138 lady
-0.523778228868 liked
-0.510737720275 boys
-0.500192621914 lt
-0.496046233825 country
-0.491323817819 2nd
-0.471499283259 3rd
-0.465936793378 profile
-0.462156108843 4th
-0.450937544246 road
-0.440495218116 checked
-0.438616598715 final
-0.43472148519 nude
-0.434681637198 students
-0.430910784915 الله
-0.426167401526 coach
-0.420598411281 basketball
-0.412829926875 beach
-0.410195781063 playlist
-0.403851843454 football
-0.399266331538 download
-0.391553449701 field
-0.387848472507 29
-0.379963665629 state
-0.377644800223 school
-0.372447634237 town
-0.370293657464 spring
-0.361544095174 awesome
-0.355634025907 lead
-0.354465733373 girls
-0.353943914804 luck
-0.352381616363

# Urban unigrams

In [18]:
print_urban(ng1_model, names1)

1.8116194838 chicago
1.55601772254 houston
1.46826214247 san
1.44349127406 ca
1.42542534037 wire
1.38673623386 dallas
1.32557432979 los
1.05984114917 tx
1.04915516571 temperature
0.920444390022 lmfao
0.76559828272 film
0.748695892638 la
0.643158959428 wild
0.634981434638 lmao
0.633441951796 dm
0.598402337303 humidity
0.585950021841 niggas
0.57534583995 smh
0.567615841444 york
0.545390621816 photos
0.518546303643 album
0.491720391638 issue
0.468084663853 email
0.465216527161 nigga
0.46216446722 feat
0.459724639932 listen
0.443708867777 street
0.436861352175 en
0.433018000269 tickets
0.429097183336 recommend
0.401334698224 stream
0.396191637574 texas
0.385879039747 current
0.374809476854 que
0.354763461047 el
0.351919412586 de
0.349809196014 unfollowed
0.343180802658 music
0.335404666876 yo
0.33080021409 ppl
0.32772163179 fit
0.318018585083 also
0.316306971089 opening
0.3160999123 beauty
0.314803513192 nah
0.314794024745 dick
0.314238069612 clean
0.314019962907 review
0.309719645379 blac

# Rural bigrams

In [19]:
names2 = ng2_cv.get_feature_names()

In [20]:
print_rural(ng2_model, names2)

-3.03141149995 rain today
-1.70316305015 earning in
-1.59871976695 appeared available
-1.1985602442 to facebook
-1.16688752908 lady gaga
-1.08666091173 lt gt
-1.05324943031 with download
-0.921339918831 automatically checked
-0.886917857369 now playing
-0.81944165033 high school
-0.795027995159 more for
-0.776355243902 for chance
-0.730563137823 for sale
-0.697597000821 earned the
-0.675504139965 available until
-0.658703450813 congrats to
-0.593889877963 good luck
-0.592915490244 miss you
-0.532874177786 to win
-0.459624526487 happy birthday
-0.459226297718 wish could
-0.457552400531 entered giveaway
-0.456569371954 great day
-0.453569911983 for great
-0.436294748455 end of
-0.436116622148 posted photo
-0.422811278518 just checked
-0.418350907052 me automatically
-0.414789036193 lt lt
-0.40513308365 best friend
-0.385699399603 liked video
-0.374950091857 to our
-0.374428196785 love you
-0.357370420906 we will
-0.357048861538 learn more
-0.315479049184 for our
-0.311220451706 the week


# Urban bigrams

In [21]:
print_urban(ng2_model, names2)

2.71043770356 has appeared
2.32441082849 los angeles
2.22837829392 san diego
2.0409289068 in ca
0.98384508656 great fit
0.831615045985 people unfollowed
0.822611440543 new video
0.798913491369 opening here
0.690112682524 new york
0.689211786036 person unfollowed
0.634544259128 to apply
0.506731079186 in on
0.506638620333 dm me
0.505592983669 follow me
0.48439410656 that shit
0.442870168154 person followed
0.427822830775 tune in
0.370379224994 this shit
0.362422214854 out here
0.36162219331 in 2017
0.358943147561 checked in
0.352283757292 now on
0.338007727621 people followed
0.332347159529 of her
0.318394935621 listen to
0.311888434359 the fuck
0.305015574542 which is
0.298701513568 our latest
0.297268876271 work in
0.292569280848 recommend anyone
0.283970976031 be like
0.283462341233 ways to
0.280927546513 my new
0.273634752463 social media
0.266564500525 my god
0.263958323259 the us
0.259481982974 re looking
0.258549809255 this was
0.253845127145 the future
0.252222133743 the show
0.

# Rural trigrams

In [22]:
names3 = ng3_cv.get_feature_names()

In [23]:
print_rural(ng3_model, names3)

-5.57083507864 on radio danz
-3.63054970132 wind mph barometer
-3.34518541373 rain today 00
-3.280014639 is on q106
-3.280014639 on q106 country
-3.1255147782 falls ny weather
-3.1255147782 honeoye falls ny
-3.06769921745 example twitter weather
-3.06769921745 twitter weather data
-3.01303028592 in steady temperature
-2.88289683454 wind speed 0mph
-2.58945415774 national weather service
-1.99928366957 today 00 in
-1.52307186509 more and more
-1.48913220091 left in the
-1.46045594102 just checked in
-1.39246784115 good luck to
-1.36559091991 and people unfollowed
-1.32264994605 new photo to
-1.25231886349 out my broadcast
-1.24434024686 to your profile
-1.23078147828 lt gt gt
-1.15512673819 cast your vote
-1.14720757242 unit name chet
-1.14720757242 motion detection unit
-1.14720757242 name chet home
-1.14720757242 home date time
-1.14720757242 chet home date
-1.14720757242 detection unit name
-1.10396502787 now available for
-1.09330066176 happy birthday hope
-1.07049714751 video to fa

# Urban trigrams

In [24]:
print_urban(ng3_model, names3)

4.90555537594 flight spotted at
4.7252154362 10 minute guide
4.53995590192 miles away traveling
3.52502750963 airlines flight spotted
3.23136182996 york city news
2.67151760388 work in ca
2.50413651048 in san diego
2.34591697654 in los angeles
2.1258235665 post your tweet
2.09473678644 surely interest you
2.09473678644 should surely interest
2.09473678644 this should surely
2.02063228519 your tweet also
2.02063228519 tweet also at
1.89277433479 can you recommend
1.51547029993 com gt gt
1.45381027246 click to apply
1.42684169082 00 humidity is
1.42684169082 gmt 0000 utc
1.42684169082 current temperature is
1.42684169082 0000 utc current
1.42684169082 utc current temperature
1.39531984685 you are looking
1.23361951049 followed me and
1.13122187042 new video to
1.11397250174 checked in at
1.06937498641 link in bio
0.935476410747 latest opening here
0.928509021713 in with download
0.867254586975 click for details
0.842969446075 00 in humidity
0.784023203742 the 10 minute
0.776872207472 con