# Regular Expressions
## Substitution and More

In [None]:
# This is necessary to print every statement on a new line
# It is activating a feature that will be standard in Perl 6
use feature qw(say);

## Substitution Basics
* In Perl the syntax for a substitution regex is ```s/regex/substitution/```
* The regex is the only part that can use metacharacters
    * The substition can consist of literal characters or special variables
   


In [60]:
$good = "This is a very simple string where everything is spaced nicely";
$bad = "This is   also  a      simple  string  but it has    very weird spacing";
$good =~ s/\s\s+//;
$bad =~ s/\s\s+/ /;
say $good;
say $bad;

This is a very simple string where everything is spaced nicely
This is also  a      simple  string  but it has    very weird spacing


1


## Simple Substitution using the g Modifier
* In most cases, we want to use substitution to substitute all matches, so we should use the g modifier

In [61]:
$bad = "This is   also  a      simple  string  but it has    very weird spacing";
$bad =~ s/\s\s+/ /g;
say $bad;

This is also a simple string but it has very weird spacing


1


## Simple Substitution with Literals
* The pattern portion can consist only of literals
    * Many languages now have a specific replace method or function to operate on strings
    * Still very useful to use fast simple tools like ```sed```

In [62]:
$umbc = "UMBC is located in MD";
$umbc =~ s/UMBC/The University of Maryland, Baltimore County/g;
say $umbc;
$umbc =~ s/MD/Maryland/g;
say $umbc;

The University of Maryland, Baltimore County is located in MD
The University of Maryland, Baltimore County is located in Maryland


1


## Backreference Variables
* Many common tasks, like reformatting, involving saving part of the match
    * To refer to a group found in the pattern, use \$x, where x is the group number

In [65]:
$today = "Today's date is 9-7-17";
$today =~ s/(\d?\d)-(\d?\d)-(\d\d)/$1\/$2\/$3/g;
say $today;

Today's date is 9/7/17


1


In [66]:
$today = "Today's date is 09-07-17";
$today =~ s/(\d?\d)-(\d?\d)-(\d\d)/$1\/$2\/$3/g;
say $today;

Today's date is 09/07/17


1


In [67]:
$today = "The University of Maryland beat the University of Texas last week in football.";
$today =~ s/University of (\w)\w+/U of $1/g;
say $today;

The U of M beat the U of T last week in football.


1


## Sidenote: Changing Delimiters
* When matching or substituting a string with the ```/``` character, it can be very annoying to escape all of them
* Almost any puncuation can be used as the delimiter
    * If it is a character that comes in pairs, you should use the left and right versions

In [68]:
$today = "Today's date is 09-07-17";
$today =~ s[(\d?\d)-(\d?\d)-(\d\d)][$1/$2/$3]g;
say $today;

Today's date is 09/07/17


1


In [69]:
$today = "Today's date is 09-07-17";
$today =~ s!(\d?\d)-(\d?\d)-(\d\d)!$1/$2/$3!g;
say $today;

Today's date is 09/07/17


1


## Lookahead and Lookbehind
* In some instances, we want to match, but not capture a piece of text
* A lookahead is written as:

    ```(?=pattern)```

* A lookbehind is written as:

    ```(?<=pattern)```

## Lookahead and Lookbehind Practice
* Lets assume that in our text every 7 digit number is a phone number

In [70]:
$bad_number = "1234567";
$bad_number =~ s/(?<=\d\d\d)(?=\d\d\d\d)/-/g;
say $bad_number;

123-4567


1


## Accessing Matches
* Often we want to retrieve a specific part of the match
* We can do this by using groups, and then refering back to the group number later in the code
* In Perl this uses the same special variables found in substitution

In [71]:
$addr = "The address of UMBC is 1000 Hilltop Circle, Baltimore, MD.";
$addr =~ /(\d+ \w+ \w+, \w+, \w+)/;
say $1

1000 Hilltop Circle, Baltimore, MD


1


## Accessing Matches with the g Modifier
* This will vary from language to language, but in Perl a while loop is used to continue to extract matches
* Lets extract all the twitter handles from the following article: http://retrieverweekly.umbc.edu/fifteen-essential-twitter-accounts-umbc-students/


In [None]:
my $article = <<'END_MESSAGE';
UMBC is often criticized by its own students for a perceived lack of student life and for a campus that falls silent after 5 p.m. For a new student or a new resident, it can be challenging to explore, let alone locate, the campus culture.

The essential first step in immersing yourself in UMBC doesn’t even require leaving your bedroom. UMBC may get quiet in the evenings, but that’s when UMBC Twitter gets loud. Between professors, departments and students themselves, UMBC has a vibrant Twitter community. If you’re new to UMBC or you’re just looking to get a little more involved, Twitter is an excellent place to start. Below, in no particular order, are fifteen accounts you should start with. Give them a follow and you might find yourself feeling a little more connected to your university.

1. David Hoffman (@CoCreatorDavid)

David Hoffman is the assistant director of student life where he is responsible for encouraging civic agency among students. One of the masterminds behind the STRiVE leadership program, Hoffman values the active participation of students in campus life. Give him a follow if you’re looking to improve your political and social engagement.

2. UMBC Career Center (@UMBCcareers)

We’ve all heard about the Career Center — orientation leaders and tour guides love to mention it — but how many of us actually use any of the resources it provides? You don’t have to schedule a mock interview or bring your resume in for approval to take advantage of the career experts at UMBC. You should probably do those things, especially you second-semester seniors, but giving them a follow on Twitter is a good move too. Their tweets will give you interview tips, resume recommendations, and job and internship leads.

3. UMBC SGA (@umbcsga)

There has been a lot of skepticism about the effectiveness and the integrity of UMBC’s Student Government Association in recent years, and not without reason. For many students, the scandal and trial defined Anthony Jankoski’s 2015-16 presidency. As a result, students lost faith and interest in the SGA. If you’re looking to see a better side of SGA, this account is a must-follow. Presumably looking to improve their transparency, SGA has tweeted details about Senate meetings, swearings-in, and the best events on campus.

4. The Arts at UMBC (@ArtsAtUMBC)

Looking to check out the “explosive new choreography” or the “sparkling new Linehan concert hall”? There are so many arts and performance spaces on campus that you might not know about, and every weekend there’s something happening in all of them. Arts at UMBC will give you all the details, so start planning your weekend now.

5. UMBC Student Life (@UMBCStudentLife)

If your goal here is to get more involved in student events around campus, this is the best account on the list. Check their feed every morning as you’re waiting in the Starbucks line to see if anything exciting is happening that day. They’ll give you the hot tips on free food, free shirts and free crafts.

6. UMBC Bookstore (@umbcbookstore)

The Bookstore and the Yum Shoppe always have sales, giveaways and other events going on. Right now, you could win a mini-fridge stocked with Coke products and you can save 20 percent on a new backpack. You can also check out the one and only Bookstore Bob rapping about textbooks. What’s not to love?

7. UMBC Athletics (@UMBCAthletics)

We may not have a football team, but we have plenty of other teams for you to check out. They all have individual accounts, so if you’re just looking for updates on Men’s Basketball, you can follow their own account. This umbrella Athletics account will keep you updated on all the Retriever teams and their wins (or their losses).

8. CoCreate UMBC (@CoCreateUMBC)

This account is run by David Hoffman and Craig Berger, who are both on this list with their personal accounts. This account is harder to describe, because it’s sort of a jumble. That’s a good thing, though — this account will give you varied and diverse stories of student leaders, as well as reflections on the ‘state of the campus.’

9. UMBC Women’s Center (@womencenterUMBC)

“All are welcome as long as they respect women. Their experiences. Their stories. Their potential.” This is the motto of the Women’s Center, one of the most important and most socially active groups on campus. If you’d like to learn more about what they do and why they do it, this is the account to follow.

10. Craig Berger (@CraigBerger)

Craig Berger’s personal account will keep you politically informed. You will notice a strong liberal slant, but if that’s not a problem for you, Berger’s account will ensure that you’re up to date with current events, local and national.

11. True Grit (@TrueGrit66)

This is another one for the sports fans. Run by the person (people?) inside the True Grit mascot costume, this account is another great place to look for sports updates. Not to mention, you’ll scroll past more True Grit selfies than you’ll ever need. If you’re looking for more school spirit on a campus that doesn’t always have a lot, start with following True Grit.

12. UMBC Dining Services (@umbcdining)

Anyone who eats on campus should check out this account. They tweet about monthly specials, cajun food day at D-Hall and where to buy the best donuts. They also post some solid food trivia. Today is National Apple Pie Day? Thanks, UMBC Dining!

13. UMBC Orientation (@UMBCorientation)

You probably won’t want to follow this account unless you’re brand new to UMBC. It can be hard to settle in, especially in the Spring, when most of your fellow students have already figured out their way around. But if you’re looking for tips and answers to questions you’re afraid to ask, follow the Orientation team.

14. Renetta G. Tull (@Renetta_Tull)

Vice Provost of the UMBC Graduate School and director of PROMISE:AGEP, Renetta Tull is a force of nature. Follow her for glimpses into her academic research, her grad student seminars, and her thoughts on social progress.

15. University of Maryland, Baltimore County (@UMBC)

There’s not a lot to say about this one. If you’re on Twitter and you go to UMBC, following this account is a no-brainer. They’ll keep you up to speed with happenings on campus across all its departments.

16. The Retriever (@retrieverweekly);

We promised you 15 must-follow Twitters, but we’d be remiss not to mention The Retriever’s own account. It’ll give you absolutely everything you could ever want to know about UMBC. At least, we like to think so.

END_MESSAGE

In [75]:
while($article =~ /\(@(.+)\)/g){
    say $1;
}

CoCreatorDavid
UMBCcareers
umbcsga
ArtsAtUMBC
UMBCStudentLife
umbcbookstore
UMBCAthletics
CoCreateUMBC
womencenterUMBC
CraigBerger
TrueGrit66
umbcdining
UMBCorientation
Renetta_Tull
UMBC
retrieverweekly


## Exercise
* How can we capture both the twitter handel and the organization/person associated with it?
* Reminder, the lines we are interested in look like this: 

```1. David Hoffman (@CoCreatorDavid)```

In [98]:
while($article =~ /^\s*\d+\.\s+((\w+\s)+)\s*\(@(.+)\)/gm){
    say $1 . $3 ;
}

David Hoffman CoCreatorDavid
UMBC Career Center UMBCcareers
UMBC SGA umbcsga
The Arts at UMBC ArtsAtUMBC
UMBC Student Life UMBCStudentLife
UMBC Bookstore umbcbookstore
UMBC Athletics UMBCAthletics
CoCreate UMBC CoCreateUMBC
Craig Berger CraigBerger
True Grit TrueGrit66
UMBC Dining Services umbcdining
UMBC Orientation UMBCorientation
The Retriever retrieverweekly


## Splitting Strings
* Regular Expressions allow strings to be split in more dynamic ways

In [101]:
$bad_csv_data = "Name,Phone Number,Email,a,list,of,websites,visited,Date";
@data = split /,(?=[A-Z])/, $bad_csv_data;
foreach $d (@data){
        if ($d =~ /,/ ){
            foreach $e (split /,/, $d, 2)
            {say $e}
        }
        else{
        say $d;
    }
}


Name
Phone Number
Email
a,list,of,websites,visited
Date


In [None]:
  if ($d =~ /,/){
        foreach $e (split /,/ , $d, 2){
            say $e;
        }
    }
    else{

## Putting it All Together
 Given an array of tweets at airlines, each element in the format 

```Tweet\tUser\tTimestamp```

How might we produce a report of all tweets that reference a specific airport and a specific flight, and print this information along with the date in a nice format along with any Twitter handles found in the tweet

In [132]:
use open ':std', ':encoding(UTF-8)';

my $filename = 'airline_tweets.tsv';
open(my $fh, '<:encoding(UTF-8)', $filename)
  or die "Could not open file '$filename' $!";
 
while (my $row = <$fh>) {
$data = "";
  chomp $row;
      if ($row =~ /\b[A-Z]+\d+\b/){
          if($row =~/\b[A-Z][A-Z][A-Z]/){
                while($row =~ /(@.+?\b)/g){
                    $data .=  " " . $1;
                }  
                $row =~ /(.*?)\t.*\t.*/;
                $tweet = $1;
                $row =~ s/^.*(\d\d\d\d)-(\d\d)-(\d\d)\s\d\d:\d\d:\d\d.*$/$2\/\/$3\/$1/;
                
                say $tweet .' '. $data .' '. $row;
      }
    }
}

@VirginAmerica amazing to me that we can't get any cold air from the vents. #VX358 #noair #worstflightever #roasted #SFOtoBOS  @VirginAmerica 02//24/2015
@VirginAmerica Love the team running Gate E9 at LAS tonight. Waited for a delayed flight, and they kept things entertaining  @VirginAmerica 02//23/2015
@VirginAmerica completely awesome experience last month BOS-LAS nonstop. Thanks for such an awesome flight and depart time. #VAbeatsJblue  @VirginAmerica 02//22/2015
@VirginAmerica @TTINAC11 I DM you  @VirginAmerica @TTINAC11 02//21/2015
@VirginAmerica  for all my flight stuff wrong and did nothing about it. Had #worst #flight ever  @VirginAmerica 02//21/2015
@VirginAmerica BIG Love/gratitude.mpower w/ http://t.co/1AGR9knCpf weRin #OSCARS2105 VIPswagbags@ #AvalonHollywood http://t.co/ybMbGs0dHn  @VirginAmerica @ # 02//20/2015
@VirginAmerica shares rise on Q4 financial results - USA TODAY http://t.co/lFS4PEFE6y  @VirginAmerica 02//20/2015
@VirginAmerica Debbie Baldwin gave a #rockstar p