# Masking the data pollution

## List of things to mask:
- Political ideologies
- political parties
- Voting behaviour
- high-profile politicians
- political organizations movements
- References to subreddits
- Social media handles
- Links
- Gender identity
- Age
- Race
- Religion
- Nationality


In [41]:
import re
import geonamescache

strategy = 0

# -------------------------------
# Regex patterns
# -------------------------------

# Ideology/party keywords
regex_keywords_pattern = r"\b(?:" \
    r"cons|conser|conserv|conservative|" \
    r"lib|liber|liberal|liberter|" \
    r"prog|progressive|" \
    r"leftist|lefty|left[-\s]?wing|" \
    r"righty|right[-\s]?wing|" \
    r"far[-\s]?(left|right)|alt[-\s]?(right|left)|" \
    r"libertarian|centrist|moderate|socialist|anarchist|communist|marxist|" \
    r"dem|demo|democrat|democrats|democratic|" \
    r"repub|republican|republicans|" \
    r"gop|dnc" \
    r")\b"

regex_keywords = re.compile(regex_keywords_pattern)

# Political figures (with optional first names)
regex_figures_pattern = r"\b(?:" \
    r"(?:donald )?trump|" \
    r"(?:joe )?biden|" \
    r"(?:barack )?obama|" \
    r"(?:george w\. )?bush|" \
    r"(?:bill )?clinton|" \
    r"(?:hillary )?clinton|" \
    r"(?:kamala )?harris|" \
    r"(?:mike )?pence|" \
    r"(?:bernie )?sanders|" \
    r"(?:alexandria ocasio-)?cortez|aoc|" \
    r"(?:nancy )?pelosi|" \
    r"(?:mitch )?mcconnell|" \
    r"(?:boris )?johnson|" \
    r"(?:theresa )?may|" \
    r"(?:rishi )?sunak|" \
    r"(?:keir )?starmer|" \
    r"(?:vladimir )?putin|" \
    r"(?:xi )?jinping|" \
    r"(?:narendra )?modi|" \
    r"(?:emmanuel )?macron|" \
    r"(?:angela )?merkel|" \
    r"(?:justin )?trudeau|" \
    r"(?:jair )?bolsonaro|" \
    r"(?:recep tayyip )?erdogan|" \
    r"(?:scott )?morrison|" \
    r"(?:jacinda )?ardern" \
    r")\b"

regex_figures = re.compile(regex_figures_pattern)

# Other political words / phrases
regex_political_words_pattern = r"\b(?:" \
            r"abortion|guns|immigration|taxes|healthcare|" \
            r"medicare|medicaid|climate|environment|protest|" \
            r"democracy|socialism|capitalism|freedom|rights|" \
            r"marriage|equality|vote|voting|election|" \
            r"police|crime|war|military|maga|woke|" \
            r"libtard|snowflake|commie|social justice warrior|sjw|" \
            r"redneck|redpill|gay|trans|queer|immigrant" \
            r"beta male|soyboy|cuck|karen|" \
            r"FBI|FSB|KGB|ICE|CIA|IDF" \
            r")\b"

regex_political_words = re.compile(regex_political_words_pattern)

# -------------------------------
# Masking function
# -------------------------------

def mask_string(text: str) -> str:
    if not isinstance(text, str):
        return text

    # Strategy 1+: remove ideology keywords
    if strategy >= 1:
        text = regex_keywords.sub("", text)

    # Strategy 2+: remove political figures
    if strategy >= 2:
        text = regex_figures.sub("", text)

    # Strategy 3+: remove political words / phrases
    if strategy >= 3:
        text = regex_political_words.sub("", text)

    # Strategy 4+: remove locations
    if strategy >= 4:
        gc = geonamescache.GeonamesCache()
        countries = set([c['name'].lower() for c in gc.get_countries().values()])
        cities = set([c['name'].lower() for c in gc.get_cities().values()])
        tokens = text.split()
        tokens = [w for w in tokens if w.lower() not in countries and w.lower() not in cities]
        text = " ".join(tokens)

    # Remove extra spaces left by deletions
    text = re.sub(r"\s+", " ", text).strip()

    return text


In [43]:
import nltk

strategy = 4

string = "has already decided that Russia will be at war. It wasnâ€™t disclosed in the previous deal, thatâ€™s the issue. Iran wasnâ€™t negotiating in good faith then, why should we trust them now? Pretty much, this is their â€œAâ€ category. As in their standing army units who are supposed to be the most well prepared and manned. What their less equipped â€œBâ€ and â€œCâ€ units may have leaves a lot of questions up in the air now. Yeah if we didnâ€™t eliminate the PRC from the map during Korea, I doubt a small expeditionary force to Ukraine would cause too much tension. Look at it this way, say in the near future says that Poland is rightfully part of Russia because of its incorporation into the Russian Empire. It may be a NATO member, but then Putin starts rattling his nuclear saber. Would we really risk a nuclear war over little old Poland? Thatâ€™s the issue down the road, that the threat of nukes can just keep making the west shy away from confrontation. The fact they had a victory announcement already written up and posted during the start of the invasion gives you an idea of what they were expecting to happen with Ukraine. Russian facebook/twitter. Mostly Facebook though. If you listen to reddit or certain media sources it does sound like their portion is much larger than in reality. Anti-imperialist is a fancy phrase to say that they kowtow to Russia and China. Sometimes with a dose of Iran and Venezuela love too. url May be paywalled though. I remember reading articles about how in a way, Putinâ€™s unpredictable nature is an advantage against the west. People fear the threat of nukes probably more than the actual chances of it happening. And of course it makes it an easy out when it comes to the topic of a western expeditionary force. Not even a NATO one even. Thatâ€™s what I believe, we flip flop so much on the economy itâ€™s a ridiculous double standard. It was Obamaâ€™s economy when things were good, then it became Trumpâ€™s economy when things went bad. And now so quickly it flips between Trump and Bidenâ€™s economy depending on who you need to blame/love at the time. The goal is likely to keep Western citizens from agitating for an expeditionary force. The perceived threat of nuclear war, coupled with the idea that NATO, in terms of being a defensive alliance for member states, is inherently the only force at play against Russia means that it's likely enough to stop any intervention in Ukraine beyond logistical support publicly. Last I checked nothing confirmed per se, more of a tightening noose than an actual encirclement from what Iâ€™ve heard. As a Republican I also want this one to just not be on our side. Sheâ€™s like AOC, absolutely horrible for appearing sane to the country at large. Loosing control over the Mykolaiv front and now Kherson, looks like theyâ€™re overextended badly in the south west. That was fairly well known, they really didnâ€™t mobilize well despite the obvious build up by Russia. Theyâ€™re doing decently now, but thatâ€™s also closing the gate after the horse escaped in a way. Thatâ€™s pretty much the easiest solution, even back in Roman Judea. If you had a rebellion you killed and then enslaved the surviving enemy population until they could no longer effectively resist. In the West, we just donâ€™t want to admit that it can work, especially compared to â€œHearts and Mindsâ€. If I remember correctly, they sent in twenty generals to boost morale. Because the vast majority of European countries are arguably about as modernized and efficient as the Russian army is. Saudi Arabia is too busy emulating Russia to end the conflict definitely in Yemen. Real issue is if Russia will accept that as a ceasefire. Because it would mean Ukraine is no conflict, and therefore has no other issues with joining NATO. At this point when Mariupol falls, is it even expected to make much of a difference? They still need a large force to garrison the city and likely are going to suffer the same supply issues if they push west. Suddenly this makes a lot of sense considering Pikamee's known likes. We have the manufacturing capabilities for what is necessary to conduct a war, cut off China's oil supplies, and keep people from starving. That's all that's really necessary. It's definitely something to keep in mind, especially since there's the traditional Chinese medicine crowd to take into consideration. And so they'll be useful being repurposed for the only thing that really matters in the end, winning the war. Edit: And more importantly that little bit about debt is a very not rooted in traditional monetary policy. Because that's very much misunderstanding the nature of a government bond. url Just going to respond to that 2028 mark, but already its been pushed back in some forecasts. Just because it is growing in one aspect doesnâ€™t mean it is that strong in important aspects. Thinking a bigger economic number has to correlate positively with a countries ability to wage war or actually overtake the US in a meaningful way doesnâ€™t reflect the reality of a situation. url Notice how whenever you comment on China and say they have issues you end up with about 500 people coming in defending China and proclaiming superiority over any odd thing, while saying trying to go towards a war footing is ridiculous? Nope, the half Windsor is extremely versatile and suits many different occasions. Maybe you'll need a full one, but overall the half will get you through pretty much any event. Honestly, try Newegg. They are indeed starting to fall back down to msrpish prices. Of course it also must be mentioned that if no one was selling them at the de jure msrp in the first place, then that's not even the msrp. Yeah, or just use an AMEX card specifically. They're in general quite generous. True, theyâ€™re handing out nagantâ€™s of all guns out now. That likely illustrates a rather poor pool of equipment. Changing the law would have done nothing considering how the guns he used were stollen in the first place. People donâ€™t like to be told that they ought to tighten their own beltâ€™s. So despite the fact that we really ought to divest consumer electronics from China, telling the consumer â€œnoâ€ isnâ€™t exactly popular. There better be a nice hot cup of Wilkins Coffee in the movie. It looks like it's finally happening, the gpu market is finally correcting itself back to a normalish level of supply and demand. If he were supposedly a nazi weâ€™d already have had a Wannsee Conference on how exactly we want to exterminate them all. Besides, thereâ€™s only one side that thinks teaching second graders that kind of relationship should be permissible instead of waiting until actual puberty. Hopefully her monkey of a friend is helping her out. Obama was the one who wanted to do a â€œRussian resetâ€ and even did a whole publicity stunt with a real reset button. For a long time the Democratic party wanted cozy relations with Russia. Calling those dressed as nazis â€œSome jackass doing this on the streetâ€ is a pretty straightforward response in my opinion. Hypothetically if it passes, what is the possibility that we start seeing strict scrutiny applied to Title Ix lawsuits in general? So long as heâ€™s alive he canâ€™t be replaced. But he also has to be present to vote for something. Which likely reduces the Democratic majority to even more ineffectiveness, even if all the remaining senators are on the same page for once. Yeah, she really interacts with her fans consistently. If she even streams a game for an hour she'll probably do 2-3 hours of chatting. And more importantly, her responses aren't necessarily tied into whether you superchat her or not. Although I must say, I am a bit surprised that a certain someone wasn't in the top 10. I would possibly argue that in Rushiaâ€™s case, itâ€™s not a full on one way parasocial relationship. Because from how Rushia has interacted with fans and even with people she has met in real life, her personality is for lack of a better way to describe it, clingy. She really does seem to enjoy chatting in general with her chat in a way thatâ€™s a bit different from a lot of Niji and Holo girls. Still canâ€™t beat out Tamaki though, but I donâ€™t think anyone knows how to handle that cat really well. People do realize that 12 weeks is very much in line with the European average right? Plus it includes the major exceptions involving rape, danger to life of the mother, and incest. If we were supposedly that imperialist, than we should have had no issues with large scale reprisals against the Somali populace to encourage compliance. Personally I think we will see a large amount of this craziness"
string = string.lower()
nltk.line_tokenize(string)
print(mask_string(string))

print(len(string))
print(len(mask_string(string)))

has already decided that will be at . it wasnâ€™t disclosed in the previous deal, thatâ€™s the issue. wasnâ€™t negotiating in good faith then, why should we trust them now? pretty much, this is their â€œaâ€ category. as in their standing army units who are supposed to be the well prepared and manned. what their less equipped â€œbâ€ and â€œcâ€ units have leaves a lot questions up in the air now. yeah if we didnâ€™t eliminate the prc from the map during korea, i doubt a small expeditionary force to would cause too tension. look at it this way, in the near future says that is rightfully part because its incorporation into the russian empire. it be a nato member, but then starts rattling his nuclear saber. would we really risk a nuclear over little old poland? thatâ€™s the issue down the road, that the threat nukes can just keep making the west shy away from confrontation. the fact they had a victory announcement already written up and posted during the start the invasion gives you an i