# Data Cleaning

## Part 1 Twitter API in python

As I mentioned in data gathering page, I would like to detect the frequency of words to gain a plot. More than this, I plan to use Sentiment analysis to give each tweet a positive or negative attitude.
At this time, I will use countVectorizer to generate a bag of words and count frequency for the tweets I gained. Then I will use nltk package to calculate the sentiment scores for each tweets. 

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn import svm
import nltk
import warnings
warnings.filterwarnings('ignore')
pytt = pd.read_csv("/Users/yangyilin/Desktop/anly-501-project-YilinYang2000-1/data/00-raw-data/twitterpython.csv")

##### Find the dataset has the NA value or not
There is no NA values in this dataset.

In [2]:
pytt.isnull().sum

<bound method NDFrame._add_numeric_operations.<locals>.sum of      Unnamed: 0     id   lang  author_id  created_at   text
0         False  False  False      False       False  False
1         False  False  False      False       False  False
2         False  False  False      False       False  False
3         False  False  False      False       False  False
4         False  False  False      False       False  False
..          ...    ...    ...        ...         ...    ...
589       False  False  False      False       False  False
590       False  False  False      False       False  False
591       False  False  False      False       False  False
592       False  False  False      False       False  False
593       False  False  False      False       False  False

[594 rows x 6 columns]>

##### Use spacy pipeline as I learned in anly580 to do text normalization and text preprocessing

Since there are a lot of useless puntuations, urls, commas, numbers, highercase, etc will affect our results of frequencies, I decided to remove them all by using pipeline in spacy. 

In [3]:
import re
import spacy
from spacy.language import Language


pipeline = spacy.load('en_core_web_sm')

# http://emailregex.com/
email_re = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

# replace = [ (pattern-to-replace, replacement),  ...]
replace = [
    (r"<a[^>]*>(.*?)</a>", r"\1"),  # Matches most URLs
    (email_re, "email"),            # Matches emails
    (r"(?<=\d),(?=\d)", ""),        # Remove commas in numbers
    (r"\d+", "number"),              # Map digits to special token <numbr>
    (r"[\t\n\r\*\.\@\,\-\/]", " "), # Punctuation and other junk
    (r"\s+", " ")                   # Stips extra whitespace
]

twitter_sentences = []
for i, d in enumerate(pytt['text']):
    for repl in replace:
        d = re.sub(repl[0], repl[1], d)
    twitter_sentences.append(d)


@Language.component("pyttPreprocessor")
def ng20_preprocess(doc):
    tokens = [token for token in doc 
              if not any((token.is_stop, token.is_punct))]
    tokens = [token.lemma_.lower().strip() for token in tokens]
    tokens = [token for token in tokens if token]
    return " ".join(tokens)


pipeline.add_pipe("pyttPreprocessor")

2022-10-07 16:23:40.284262: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<function __main__.ng20_preprocess(doc)>

##### Pass data through spacy pipeline

In [4]:
docs = []
for sent in twitter_sentences:
    docs.append(pipeline(sent))

##### Generate bag of words and count the frequecies of words

I want to learn about through these keywords, what kind of problems platform users discuss more in their daily lives, so I used the bag of words I generated to count the frequencies of each word appears by using CountVectorizer and set a dataframe which contains texts and counts. Then I sort the table with descending order to see the result more clearly. 

In [5]:
countvectorizer = CountVectorizer()
ttbow = countvectorizer.fit(docs)
features = ttbow.vocabulary_.keys()
counts = ttbow.vocabulary_.values()
ttbow=pd.DataFrame({'words':features,'counts':counts})


In [6]:
ttbow = ttbow.sort_values(by=['counts'],ascending=False)

In [7]:
ttbow

Unnamed: 0,words,counts
674,zznumbernnumbervdscj,733
173,zfpgcrvwjnumber,732
650,zfdnumberwgqcxnumber,731
48,zecops,730
414,york,729
...,...,...
11,acti,4
36,act,3
633,acquisition,2
316,accurate,1


During this dataframe, 


words: name of the count words


counts: frequencies that each word appears

##### Export dataframe to csv file for further use

In [8]:
ttbow.to_csv("/Users/yangyilin/Desktop/anly-501-project-YilinYang2000-1/data/00-raw-data/wordfreqpython.csv")


#### Then I want to do sentiment analysis for using this dataframe
Sentiment analysis for this part of data cleaning is very important. The sentimentintensityanalyzer can automatically helps me to calculate the score of a text is positive, neutral, negative or compound. I used for loop to calculate text by text and then I get a whole dictionary. Then I trasnferred dictionary to dataframe and split sentiment scores into three columns for the future visualization. Finally, I combine text and sentiment scores together to see the result clearly. 

In [9]:
from nltk.sentiment import SentimentIntensityAnalyzer
def getSentiments(df):
    sid = SentimentIntensityAnalyzer()
    tweet_str = ""
    tweetscore = []
    for tweet in df['text']:
        tweet_str = tweet_str + " " + tweet
        score = sid.polarity_scores(tweet_str)
        tweetscore.append(score)
    return tweetscore


##### Create a dataframe for sentiment analysis as tweets score

In [12]:
sentiment = getSentiments(pytt)

In [7]:
texts = pd.DataFrame(pytt.text)

In [8]:
sentimentscore = pd.DataFrame.from_dict(sentiment)
sentimentscore

Unnamed: 0,neg,neu,pos,compound
0,0.000,1.000,0.000,0.0000
1,0.000,0.959,0.041,0.2960
2,0.000,0.970,0.030,0.2960
3,0.000,0.943,0.057,0.7506
4,0.000,0.889,0.111,0.9370
...,...,...,...,...
589,0.051,0.797,0.152,1.0000
590,0.051,0.797,0.152,1.0000
591,0.051,0.797,0.152,1.0000
592,0.051,0.797,0.152,1.0000


In [9]:
tweetscore = pd.concat([texts,sentimentscore],axis=1)

In [10]:
tweetscore

Unnamed: 0,text,neg,neu,pos,compound
0,RT @relyanceai: Action by California Attorney ...,0.000,1.000,0.000,0.0000
1,Join @IDology and @AiteNovarica on October 12t...,0.000,0.959,0.041,0.2960
2,RT @relyanceai: Action by California Attorney ...,0.000,0.970,0.030,0.2960
3,Action by California Attorney General Shows En...,0.000,0.943,0.057,0.7506
4,RT @deanhager: Welcome @ZecOps to the @JamfSof...,0.000,0.889,0.111,0.9370
...,...,...,...,...,...
589,RT @THORmaximalist: I'm so impressed with @ses...,0.051,0.797,0.152,1.0000
590,RT @nathanbaugh27: Apple surpassed $3.5B in an...,0.051,0.797,0.152,1.0000
591,Data breaches are now part of mainstream repor...,0.051,0.797,0.152,1.0000
592,We are now entering a new era of consumer inte...,0.051,0.797,0.152,1.0000


During this dataframe,


text: each tweet which I collected


neg: negative sentiment score


neutral: neural sentiment score


pos: positive sentiment score


compound: the overall scores

In [12]:
tweetscore.to_csv("/Users/yangyilin/Desktop/anly-501-project-YilinYang2000-1/data/00-raw-data/pytweetscore.csv")

More than this, I would like to define each tweet is negative, neural, positive with compound score in order to the further explorations of users' attitude. At this time, I plan to first, make a definition code to define one sentence is positive, negative or neutural. Then I will use a for loop to bring each sentence a classification and store it in a dataframe.

In [13]:
def predict_sentiment(sentence):
  '''Function to predict sentiment of a sentence'''
    
  sid = SentimentIntensityAnalyzer()
  sentiment_dict = sid.polarity_scores(sentence)

  # decide sentiment as positive, negative and neutral 
  if sentiment_dict['compound'] >= 0.05 : 
      return ("Positive", round(sentiment_dict['pos']*100, 2))

  elif sentiment_dict['compound'] <= - 0.05 : 
      return ("Negative", round(sentiment_dict['neg']*100, 2))

  else : 
      return ("Neutral", round(sentiment_dict['neu']*100, 2))

In [14]:
pytt.head()

Unnamed: 0.1,Unnamed: 0,id,lang,author_id,created_at,text
0,0,1574462548027609090,en,1523745650432741377,2022-09-26T18:14:58.000Z,RT @relyanceai: Action by California Attorney ...
1,1,1574460821073268738,en,377867202,2022-09-26T18:08:06.000Z,Join @IDology and @AiteNovarica on October 12t...
2,2,1574459562412822528,en,746303248198303744,2022-09-26T18:03:06.000Z,RT @relyanceai: Action by California Attorney ...
3,3,1574457040243916800,en,1227402972990099456,2022-09-26T17:53:05.000Z,Action by California Attorney General Shows En...
4,4,1574453153986056192,en,870794599085879298,2022-09-26T17:37:38.000Z,RT @deanhager: Welcome @ZecOps to the @JamfSof...


Use the for loop to give each sentence a classification whether it is positive, negative or nutural and use append() function to build a dictionary of result

In [56]:
result = [] 
for i in pytt.text:
    result.append(predict_sentiment(i))


Convert dictionary to dataframe and combine this result with text dataframe. At that time, rename the column name in order to make the dataframe easier to understand.

In [52]:
sentimentresult = pd.DataFrame.from_dict(result)
tweetresult = pd.concat([texts,sentimentresult],axis=1)

tweetresult.rename(columns={0:'result'},inplace=True)
tweetresult.rename(columns={1:'scores'},inplace=True)


In [53]:
tweetresult

Unnamed: 0,text,result,scores
0,RT @relyanceai: Action by California Attorney ...,Neutral,100.0
1,Join @IDology and @AiteNovarica on October 12t...,Positive,6.6
2,RT @relyanceai: Action by California Attorney ...,Neutral,100.0
3,Action by California Attorney General Shows En...,Positive,10.7
4,RT @deanhager: Welcome @ZecOps to the @JamfSof...,Positive,34.6
...,...,...,...
589,RT @THORmaximalist: I'm so impressed with @ses...,Positive,40.4
590,RT @nathanbaugh27: Apple surpassed $3.5B in an...,Negative,14.4
591,Data breaches are now part of mainstream repor...,Positive,7.2
592,We are now entering a new era of consumer inte...,Positive,11.2


##### Export the dataframe to csv file

In [55]:
tweetresult.to_csv("/Users/yangyilin/Desktop/anly-501-project-YilinYang2000-1/data/00-raw-data/pytweetresult.csv")

## Part2 Twitter API in R

For R API, I used twitter to scratch keywords like "instagram, tiktok, youtube, facebook" to scratch users' attitudes about these platforms. 

In [2]:
library(selectr)
library(rvest)
library(xml2)
library(wordcloud2) # for generating really cool looking wordclouds
library(tm) # for text minning
library(dplyr) # loads of fun stuff including piping
library(ROAuth)
library(jsonlite)
library(httpuv)
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
suppressWarnings(expr)#ignore warning
options(warn=-1)

Loading required package: NLP


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: RColorBrewer



In [3]:
TweetsDF <- read.csv("/Users/yangyilin/Desktop/anly-501-project-YilinYang2000-1/data/00-raw-data/tweetinr.csv")

##### Do text transformation for the dataframe
Take a glimpse of the dataframe, there is a lot of punctuation that is totally not useful for me to count the future frequency. As a result, I used corpus() to normalizing and tokenizing the dataset, then I use tm_map to remove punctuations like "/","@",etc. More than this, because there are a lot of urls, numbers, white space, and highercase in the text. I decided to remove them all because they will afffect my results. So I cleaned the data as below:

In [4]:
FName = "~/Desktop/wemediaexample.txt"
MyFile <- file(FName)
cat(unlist(TweetsDF), " ", file=MyFile, sep="\n")
close(MyFile)

In [5]:
twittertext = Corpus(VectorSource(TweetsDF$text))
toSpace = content_transformer(
              function (x, pattern)
              gsub(pattern, " ", x))
twittertext1 = tm_map(twittertext, toSpace, "/")
twittertext1 = tm_map(twittertext, toSpace, "@")
twittertext1 = tm_map(twittertext, toSpace, "#")
twittertext1 = tm_map(twittertext1, content_transformer(tolower))
twittertext1 = tm_map(twittertext1, removeNumbers)
twittertext1 = tm_map(twittertext1, stripWhitespace)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
twittertext1 = tm_map(twittertext, removeURL)
strwrap(twittertext1)


<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 250

##### Calculate the frequencies of words
Since I need the results to compare the popularity among each platforms, the frequencies of words will help me to understand which platforms are discussed more among twitter. As a result, I used termdocumentmatrix() to calculate the frequecies each word appears. Then I sort the table in the descending order to see the frequencies more straightforward.


In [5]:
twittertm = TermDocumentMatrix(twittertext1)
m = as.matrix(twittertm)
v = sort(rowSums(m), 
         decreasing = TRUE)
d = data.frame(word = names(v),
               freq = v)
head(d, 10)

Unnamed: 0_level_0,word,freq
Unnamed: 0_level_1,<chr>,<dbl>
tik,tik,166
tok,tok,162
and,and,140
brand,brand,90
she,she,90
wearing,wearing,90
posted,posted,88
viral,viral,88
://t.co/xiuittpwgv,://t.co/xiuittpwgv,87
@skyliej_:,@skyliej_:,87


During this dataframe, 


word: name of the count words


freq: frequencies that each word appears

In [6]:
write.csv(d,"/Users/yangyilin/Desktop/anly-501-project-YilinYang2000-1/data/00-raw-data/twittertm.csv")

## Modified data

Instagram is one of the most popular and famous platforms for users to network, post their daily lives. There are a lot of famous internet influencers here. Instagram is also one of the important parts for wemedia to develop. Users can create channels to post their daily lives, show their abilities, post some fun videos in order to attract more fans. Ad sellers also analyze their business value to contact instagram influencers to promote their products in order to gain more benefits. 

In this dataset, it clearly shows followers'number, average likes each post, country etc. I would like to use this dataset to deeper analyze the trend of internet influencers followers. Country distribution of these influencers. More than this, I would like to use other datasets to locate the fields which these inernet influencers focus to further analyze factors of these internet influencers. What makes them success. 

channel info: Username in instagram

influence score: It is calculated based on their popularity.

posts: total posts they have

followers: total followers they have

avg_likes: average likes of their total posts

60_days_eng_rate: 60 days of engagement rate

new_post_avg_like: a calculation of the average likes they gained from new posts. 

total_likes: total likes of their posts in instagram.

country: users' origin. What countries they from?

In [12]:
library(tidyverse)
library(dplyr)
library(reshape2)
library(tidyr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.0      [32m✔[39m [34mforcats[39m 0.5.2 
[32m✔[39m [34mreadr  [39m 2.1.2      
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘reshape2’


The following object is masked from ‘package:tidyr’:

    smiths




In [26]:
instagram_infl = read.csv("/Users/yangyilin/Desktop/anly-501-project-YilinYang2000-1/data/00-raw-data/top_insta_influencers_data.csv")

In [27]:
head(instagram_infl)

Unnamed: 0_level_0,rank,channel_info,influence_score,posts,followers,avg_likes,X60_day_eng_rate,new_post_avg_like,total_likes,country
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,cristiano,92,3.3k,475.8m,8.7m,1.39%,6.5m,29.0b,Spain
2,2,kyliejenner,91,6.9k,366.2m,8.3m,1.62%,5.9m,57.4b,United States
3,3,leomessi,90,0.89k,357.3m,6.8m,1.24%,4.4m,6.0b,
4,4,selenagomez,93,1.8k,342.7m,6.2m,0.97%,3.3m,11.5b,United States
5,5,therock,91,6.8k,334.1m,1.9m,0.20%,665.3k,12.5b,United States
6,6,kimkardashian,91,5.6k,329.2m,3.5m,0.88%,2.9m,19.9b,United States


At first, I notified there are some NA values in the column country, Because I need to use these columns, so I don't want to drop these values. I decided to change NA values to undefined to explain that these influencers have not defined country. 

In [28]:
instagram_infl$country[instagram_infl$country==""] <- "Undefined"

In [29]:
head(instagram_infl)

Unnamed: 0_level_0,rank,channel_info,influence_score,posts,followers,avg_likes,X60_day_eng_rate,new_post_avg_like,total_likes,country
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,cristiano,92,3.3k,475.8m,8.7m,1.39%,6.5m,29.0b,Spain
2,2,kyliejenner,91,6.9k,366.2m,8.3m,1.62%,5.9m,57.4b,United States
3,3,leomessi,90,0.89k,357.3m,6.8m,1.24%,4.4m,6.0b,Undefined
4,4,selenagomez,93,1.8k,342.7m,6.2m,0.97%,3.3m,11.5b,United States
5,5,therock,91,6.8k,334.1m,1.9m,0.20%,665.3k,12.5b,United States
6,6,kimkardashian,91,5.6k,329.2m,3.5m,0.88%,2.9m,19.9b,United States


As we noticed, the avg_likes column contains different unit, I want to expand it as the numeric instead of character in order to do the further visualization and comparison. Here is my code:

In [30]:
library(stringr)
instagram_infl$avg_likes1 <- str_extract(instagram_infl$avg_likes, "\\d+\\.?\\d*") #extract number without units
instagram_infl$avg_likes1unit <- str_sub(instagram_infl$avg_likes,-1) #extract units since units are last words
instagram_infl$avg_likes1 <- as.numeric(instagram_infl$avg_likes1) #change the new column datatype as num
instagram_infl$avg_likes1 <- ifelse(instagram_infl$avg_likes1unit == 'm', instagram_infl$avg_likes1*1000000, instagram_infl$avg_likes1*1000)
#Since the unit is different, I used the ifelse function to do the further calculation
instagram_infl <- select(instagram_infl,-c(avg_likes,avg_likes1unit))#Drop useless columns
names(instagram_infl)[names(instagram_infl)=='avg_likes1'] <- 'avg_likes'#rename columns
head(instagram_infl)

Unnamed: 0_level_0,rank,channel_info,influence_score,posts,followers,X60_day_eng_rate,new_post_avg_like,total_likes,country,avg_likes
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,1,cristiano,92,3.3k,475.8m,1.39%,6.5m,29.0b,Spain,8700000
2,2,kyliejenner,91,6.9k,366.2m,1.62%,5.9m,57.4b,United States,8300000
3,3,leomessi,90,0.89k,357.3m,1.24%,4.4m,6.0b,Undefined,6800000
4,4,selenagomez,93,1.8k,342.7m,0.97%,3.3m,11.5b,United States,6200000
5,5,therock,91,6.8k,334.1m,0.20%,665.3k,12.5b,United States,1900000
6,6,kimkardashian,91,5.6k,329.2m,0.88%,2.9m,19.9b,United States,3500000


At that time, I used summary and want to know the median, mean, etc values of each numeric number, then I noticed that other columns like posts also cannot convert to numeric since the cells in the column have units. As a result, I decided to get rid of units in each columns and change the column type into numeric

In [31]:
print(summary(instagram_infl))

      rank        channel_info       influence_score    posts          
 Min.   :  1.00   Length:200         Min.   :22.00   Length:200        
 1st Qu.: 50.75   Class :character   1st Qu.:80.00   Class :character  
 Median :100.50   Mode  :character   Median :84.00   Mode  :character  
 Mean   :100.50                      Mean   :81.82                     
 3rd Qu.:150.25                      3rd Qu.:86.00                     
 Max.   :200.00                      Max.   :93.00                     
  followers         X60_day_eng_rate   new_post_avg_like  total_likes       
 Length:200         Length:200         Length:200         Length:200        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                  

In [32]:
instagram_infl$posts = substr(instagram_infl$posts,1,nchar(instagram_infl$posts)-1)
instagram_infl$posts = as.numeric(instagram_infl$posts)
names(instagram_infl)[names(instagram_infl)=='posts'] <- 'posts(k)'
instagram_infl$followers = substr(instagram_infl$followers,1,nchar(instagram_infl$followers)-1)
instagram_infl$followers = as.numeric(instagram_infl$followers)
names(instagram_infl)[names(instagram_infl)=='followers'] <- 'followers(m)'
instagram_infl$X60_day_eng_rate = substr(instagram_infl$X60_day_eng_rate,1,nchar(instagram_infl$X60_day_eng_rate)-1)
instagram_infl$X60_day_eng_rate = as.numeric(instagram_infl$X60_day_eng_rate)
names(instagram_infl)[names(instagram_infl)=='X60_day_eng_rate'] <- 'X60_day_eng_rate(%)'
instagram_infl$new_post_avg_like = substr(instagram_infl$new_post_avg_like,1,nchar(instagram_infl$new_post_avg_like)-1)
instagram_infl$new_post_avg_like = as.numeric(instagram_infl$new_post_avg_like)
names(instagram_infl)[names(instagram_infl)=='new_post_avg_like'] <- 'new_post_avg_like(m)'
instagram_infl$total_likes = substr(instagram_infl$total_likes,1,nchar(instagram_infl$total_likes)-1)
instagram_infl$total_likes = as.numeric(instagram_infl$total_likes)
names(instagram_infl)[names(instagram_infl)=='total_likes'] <- 'total_likes(b)'
head(instagram_infl)

Unnamed: 0_level_0,rank,channel_info,influence_score,posts(k),followers(m),X60_day_eng_rate(%),new_post_avg_like(m),total_likes(b),country,avg_likes
Unnamed: 0_level_1,<int>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
1,1,cristiano,92,3.3,475.8,1.39,6.5,29.0,Spain,8700000
2,2,kyliejenner,91,6.9,366.2,1.62,5.9,57.4,United States,8300000
3,3,leomessi,90,0.89,357.3,1.24,4.4,6.0,Undefined,6800000
4,4,selenagomez,93,1.8,342.7,0.97,3.3,11.5,United States,6200000
5,5,therock,91,6.8,334.1,0.2,665.3,12.5,United States,1900000
6,6,kimkardashian,91,5.6,329.2,0.88,2.9,19.9,United States,3500000


In [33]:
summary(instagram_infl)

      rank        channel_info       influence_score    posts(k)      
 Min.   :  1.00   Length:200         Min.   :22.00   Min.   : 0.0100  
 1st Qu.: 50.75   Class :character   1st Qu.:80.00   1st Qu.: 0.9475  
 Median :100.50   Mode  :character   Median :84.00   Median : 2.1000  
 Mean   :100.50                      Mean   :81.82   Mean   : 3.4998  
 3rd Qu.:150.25                      3rd Qu.:86.00   3rd Qu.: 5.0250  
 Max.   :200.00                      Max.   :93.00   Max.   :17.5000  
                                                                      
  followers(m)    X60_day_eng_rate(%) new_post_avg_like(m) total_likes(b)  
 Min.   : 32.80   Min.   : 0.010      Min.   :  1.0        Min.   :  1.00  
 1st Qu.: 40.00   1st Qu.: 0.410      1st Qu.:  4.4        1st Qu.:  2.00  
 Median : 50.05   Median : 0.880      Median :149.3        Median :  4.00  
 Mean   : 77.41   Mean   : 1.902      Mean   :247.0        Mean   :142.13  
 3rd Qu.: 68.90   3rd Qu.: 2.035      3rd Qu.:412.3 

In [34]:
head(instagram_infl)

Unnamed: 0_level_0,rank,channel_info,influence_score,posts(k),followers(m),X60_day_eng_rate(%),new_post_avg_like(m),total_likes(b),country,avg_likes
Unnamed: 0_level_1,<int>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
1,1,cristiano,92,3.3,475.8,1.39,6.5,29.0,Spain,8700000
2,2,kyliejenner,91,6.9,366.2,1.62,5.9,57.4,United States,8300000
3,3,leomessi,90,0.89,357.3,1.24,4.4,6.0,Undefined,6800000
4,4,selenagomez,93,1.8,342.7,0.97,3.3,11.5,United States,6200000
5,5,therock,91,6.8,334.1,0.2,665.3,12.5,United States,1900000
6,6,kimkardashian,91,5.6,329.2,0.88,2.9,19.9,United States,3500000


Now the dataset is much clean now and it is easy for further calculation. Now we can export dataframe as csv files now.

In [35]:
write.csv(instagram_infl,"/Users/yangyilin/Desktop/anly-501-project-YilinYang2000-1/data/00-raw-data/instagram_infl.csv")