## Environment Setting

In [106]:
import pandas as pd
import numpy as np

In [107]:
import multiprocessing

num_processors = multiprocessing.cpu_count()
num_processors

20

In [108]:
from pandarallel import pandarallel

# Initialize pandarallel
pandarallel.initialize(nb_workers = multiprocessing.cpu_count()-1, use_memory_fs=False, progress_bar=False)

INFO: Pandarallel will run on 19 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

https://nalepae.github.io/pandarallel/troubleshooting/


In [109]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

## Data Loading

In [110]:
%%time
df_raw = pd.read_parquet('news_final_project.parquet', engine='pyarrow')

CPU times: total: 4.53 s
Wall time: 7.8 s


In [111]:
df_raw.head(1)

Unnamed: 0,url,date,language,title,text
0,http://en.people.cn/n3/2021/0318/c90000-9830122.html,2021-03-18,en,Artificial intelligence improves parking efficiency in Chinese cities - People's Daily Online,"\n\nArtificial intelligence improves parking efficiency in Chinese cities - People's Daily Online\n\nHome\nChina Politics\nForeign Affairs\nOpinions\nVideo: We Are China\nBusiness\nMilitary\nWorld\nSociety\nCulture\nTravel\nScience\nSports\nPhoto\n\nLanguages\n\nChinese\nJapanese\nFrench\nSpanish\nRussian\nArabic\nKorean\nGerman\nPortuguese\nThursday, March 18, 2021\nHome>>\n\t\t\nArtificial intelligence improves parking efficiency in Chinese cities\nBy Liu Shiyao (People's Daily) 09:16, March 18, 2021\nPhoto taken on July 1, 2019, shows a sign for electronic toll collection (ETC) newly set up at a roadside parking space on Yangzhuang road, Shijingshan district, Beijing. Some urban areas of the city started to use ETC system for roadside parking spaces since July 1, 2019. (People’s Daily Online/Li Wenming)\n\n\tThanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion.\n\n\tAs the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent.\n\n\tWith the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away.\n\n\t“This road used to be full of cars, and even the normal lanes were occupied. You could hardly move a bit during the morning and evening commute time,” recalled a citizen surnamed Wang, who lives in Chaoyang district of Beijing.\n\n\t“Since the summer of 2019, roadside ETC devices have been installed here. With all the cars being parked in designated parking spots on the roadside, the road now seems brighter and wider,” Wang said.\n\n\tThe smart roadside ETC system “AIpark Sky Eye” adopted by Beijing is developed, operated, and maintained by AIpark, a Beijing-based leading smart parking solution provider.\n\n\tThe company’s intelligent system has brought into full play the advantages of AI technologies and effectively addressed the shortage of parking spaces and the problem of irregular parking in cities. The system has therefore been listed among the country’s innovation projects that integrate AI deeply into the real economy in 2018 by China’s Ministry of Industry and Information Technology (MIIT).\n\n\tTraditional parking management equipment and monitoring devices have failed to meet the actual needs of cities due to limited application scenarios and technical capacity. There are many deficiencies in traditional parking systems. For example, magnetic devices cannot identify detailed information about vehicles; each video monitoring pile can only cover one parking spot; and manual collection of parking fees costs too much.\n\n\tSuch problems don’t exist in smart machines. The “AIpark Sky Eye” system boasts strong stability and high recognition rate. Besides, it can resist the interference of extreme weather conditions like rain, snow, and fog, and form complete graphic evidence based on wheel path of vehicles.\n\n\tEach set of cameras of the “AIpark Sky Eye” system can monitor multiple parking spots at the same time for 24 hours a day. The data collected by front-end cameras are processed using multi-dimensional deep learning algorithm before they are uploaded on to an AI computing cloud platform for data enrichment. The platform then distributes identification results to transport authorities.\n\n\tThe most distinctive innovation in the technological package of the system is precision brought about by high-mounted parking system cameras, according to Xiang Yanping, senior vice president of AIpark, noting that the cameras can recognize more complex static and dynamic reality scenes.\n\n\t“For example, the equipment can accurately identify irregular parking behaviors and state such as double parking and frequent maneuvers, precisely recognize detailed information including plate number and vehicle color, and make good judgment on the behaviors of drivers and pedestrians,” Xiang said.\n\n\tOnce the high-mounted parking system cameras are installed, they can help with many aspects of integrated urban governance, which represents another advantage of the “AIpark Sky Eye” system.\n\n\tBesides managing parking fee collection, high-mounted camera system can also provide data for traffic improvements. The snapshots obtained from the camera system can help solve problems including illegal and inappropriate parking and vehicle theft.\n\n\tSo far, the smart ETC system of AIpark has been introduced into more than 20 cities in China, signaling increasingly important roles of AI in improving parking efficiency and order as well as new development opportunities for smart parking industry.\n【1】【2】【3】\nPhotos\nNaval fleet steams in East China Sea\nNewborn golden snub-nosed monkey makes debut\nIn pics: birds across China\nGrain painting studio helps villagers to increase income\nRelated Stories\nExhibition highlighting art-science integration opens in BeijingChina’s AI industry poised to enter boom timesAlibaba outlines 10 technology trends for 2019China’s leading AI enterprise iFlytek to develop health information technologyCheetah Mobile wades into artificial intelligenceChina overtakes the US in investment in AI5 million artificial intelligence talents urgently needed in ChinaGraduate students give 'voice' to sign languageArtificial Intelligence in real livesMicrosoft embraces artificial intelligence\nAbout People's Daily Online | Join Us | Contact Us\nCopyright © 2021 People's Daily Online. All Rights Reserved.\n\t\n"


In [112]:
df_raw.shape

(200332, 5)

## Data Cleaning

In [113]:
import re

In [114]:
# filter english article
df_raw = df_raw[df_raw.language == 'en']
df_raw.shape

(200332, 5)

#### URL cleaning

In [116]:
df_raw.url[:5]

0                                                                                                                                                                                                                                                                                    http://en.people.cn/n3/2021/0318/c90000-9830122.html
1                                                                                                                                                                                                http://newsparliament.com/2020/02/27/children-with-autism-saw-their-learning-and-social-skills-boosted-after-playing-with-this-ai-robot/
2                                                                                                                                                                                                                                                                                                        http://www.dataweek.co.za/12835r
3         

In [118]:
def extract_domain(url):
    from urllib.parse import urlparse
    parsed_url = urlparse(url)
    domain = parsed_url.netloc
    return domain

In [119]:
df_raw['domain'] = df_raw['url'].parallel_apply(extract_domain)

In [120]:
df_raw['domain'].head()

0                                 en.people.cn
1                           newsparliament.com
2                           www.dataweek.co.za
3    www.homeoffice.consumerelectronicsnet.com
4                        www.itbusinessnet.com
Name: domain, dtype: object

#### Title Cleaning

In [121]:
df_raw.title.head(10)

0                                                                                                                                         Artificial intelligence improves parking efficiency in Chinese cities - People's Daily Online
1                                                                                                                  Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot – News Parliament
2                                                                                                            Forget ML, AI and Industry 4.0 – obsolescence should be your focus - 26 February 2021 - Test & Rework Solutions - Dataweek
3                                                                                                                            Strategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered – Consumer Electronics Net
4                                                                       

In [122]:
def clean_title(title):
    import re

    cleaned_title = re.split(r'[–|]', title)[0].strip()

    return cleaned_title

In [123]:
df_raw['title'] = df_raw['title'].str.replace('-', '–')

In [124]:
df_raw['clean_title'] = df_raw['title'].apply(clean_title)

In [125]:
df_raw.clean_title.head(10)

0                                                                    Artificial intelligence improves parking efficiency in Chinese cities
1                                       Children With Autism Saw Their Learning and Social Skills Boosted After Playing With This AI Robot
2                                                                                                           Forget ML, AI and Industry 4.0
3                                                          Strategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered
4                Olympus to Support Endoscopic AI Diagnosis Education for Doctors in India and to Launch AI Diagnostic Support Application
5                                                                      Cr Bard Inc Has Returned 48.9% Since SmarTrend Recommendation (BCR)
6                                                       From the Bard to broadcaster: Stratford Festival builds new identity with streamer
7                          

#### Text Cleaning

In [126]:
from IPython.display import Markdown as md

In [127]:
df_raw.text[3]

'\n\nStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered – Consumer Electronics Net\n \nSkip to content\n\nConsumer Electronics Net\n\nPrimary Menu\n\nConsumer Electronics Net\n\nSearch for:\n \nHomeNewsStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered \n \n                                 News\n                             \n \nStrategy Analytics: 71% of Smartphones Sold Globally in 2021 will be AI Powered\n                    7 hours ago            \n \n\nArtificial Intelligence Now Powers the Majority of Smartphones\n\nBOSTON–(BUSINESS WIRE)–Strategy Analytics in a newly published report, Smartphones: Global Artificial Intelligence Technologies Forecast to 2025, finds that on-device Artificial Intelligence (AI) is being rapidly implemented by smartphone vendors. AI is used in various functions inside smartphones such as intelligent power optimization, imaging, virtual assistants, and to enhance device performance. The report 

Found that the first part of text are always title + some description of the article website such as language settings, author name, etc. After the second occurence of the title, there are the true text of the article.

Therefore, we could use this characteristic to exclude the first useless part of the text.

In [128]:
equal_irrelevant = [


    'comments section',
    'privacy policy',
    'Contact us',
    'Join us',
    'Photos',
    'Home',
    'No results found',
    'Report a problem',
    'Related articles',
 
]

# previously I removed all the elements (may be paragraphs of the main text seprarated by a new line) contains those words
# and therefore may remove so some articles that cotain keywords in those removed paragraphs
# So finally there are only 118k articles and after fixing this problem, there are 168k


In [129]:
# include irrelevant words --> remove
include_irrelevant = [
    'http',
    'www',
    '.com',
    'subscribe to our', 
    'Photo taken',
    'press release',
    'Connect with us',
    'Follow us on instagram',
    'Follow us on twitter', 
    'Like us on facebook',
    'Email us at',
    'Copyright @',
    'Copyright at',
    '.org',
    'You may have missed',
    'Leave a reply',
    'Related posts',
    'photo credit',
    'terms of service',
    'all rights reserved',
    'click here to',
    'follow us on',
    'sign up for',
    'editor’s note',
    'Share this article',
    'Image source',
    'Join our mailing list',
    'Sponsored content',


    # the below is second cleaning after NER
    'The Associated Press',
    'When typing in this field a list of search results',
    'Gray Media Group',
    'Digi Communications'
   
]

In [130]:

recommended_phrases = [
    "you may have missed",
    "read more",
    "related articles",
    "recommended for you",
    "also read",
    "explore more",
    "don't miss",
    'do not miss',
    "similar stories",
    "trending now",
    "more stories",
    "related Stories",
    'more information and articles',
    'sponsored content',
    'Editor’s pick',
    'Matched content'

]



##### Test

In [58]:
test = df_raw.text[0]

In [20]:
elements = re.split('\r|\n|\t|\/|\|', test)
elements

['',
 '',
 "Artificial intelligence improves parking efficiency in Chinese cities - People's Daily Online",
 '',
 'Home',
 'China Politics',
 'Foreign Affairs',
 'Opinions',
 'Video: We Are China',
 'Business',
 'Military',
 'World',
 'Society',
 'Culture',
 'Travel',
 'Science',
 'Sports',
 'Photo',
 '',
 'Languages',
 '',
 'Chinese',
 'Japanese',
 'French',
 'Spanish',
 'Russian',
 'Arabic',
 'Korean',
 'German',
 'Portuguese',
 'Thursday, March 18, 2021',
 'Home>>',
 '',
 '',
 '',
 'Artificial intelligence improves parking efficiency in Chinese cities',
 "By Liu\xa0Shiyao (People's Daily) 09:16, March 18, 2021",
 'Photo taken on July 1, 2019, shows a sign for electronic toll collection (ETC) newly set up at a roadside parking space on Yangzhuang road, Shijingshan district, Beijing. Some urban areas of the city started to use ETC system for roadside parking spaces since July 1, 2019. (People’s Daily Online',
 'Li Wenming)',
 '',
 '',
 'Thanks to the application of an artificial intel

In [21]:
len(elements)

99

In [22]:
elements = [el for el in elements if len(el) > 0]


# now the list contains main text, some long sentence of irrelanvent instructions, "read others"
# remove elements that contains anything in irrelanvent
elements = [el for el in elements if not any(irrelevant_word in el for irrelevant_word in irrelevant)]

In [23]:
elements

["Artificial intelligence improves parking efficiency in Chinese cities - People's Daily Online",
 'China Politics',
 'Foreign Affairs',
 'Opinions',
 'Video: We Are China',
 'Business',
 'Military',
 'World',
 'Society',
 'Culture',
 'Travel',
 'Science',
 'Sports',
 'Photo',
 'Languages',
 'Chinese',
 'Japanese',
 'French',
 'Spanish',
 'Russian',
 'Arabic',
 'Korean',
 'German',
 'Portuguese',
 'Thursday, March 18, 2021',
 'Artificial intelligence improves parking efficiency in Chinese cities',
 "By Liu\xa0Shiyao (People's Daily) 09:16, March 18, 2021",
 'Li Wenming)',
 'Thanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion.',
 'As the city further deepens its roadside parking reform, the ETC system has almost covered al

In [24]:
len(elements)

46

In [25]:
elements = [el for el in elements if len(el.split()) > 10 and len(el) > 50]

In [26]:
elements

["Artificial intelligence improves parking efficiency in Chinese cities - People's Daily Online",
 'Thanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion.',
 'As the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent.',
 'With the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away.',
 '“Since the summer of 2019, roadside ETC devices have been installed here. With all the cars being parked in designated parking spots on the roadside, the road now s

In [27]:
len(elements)

12

In [28]:
elements = [el for el in elements if df_raw.clean_title[0] not in el]

In [29]:
elements

['Thanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion.',
 'As the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent.',
 'With the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away.',
 '“Since the summer of 2019, roadside ETC devices have been installed here. With all the cars being parked in designated parking spots on the roadside, the road now seems brighter and wider,” Wang said.',
 'The smart roadside ETC system “AIpark Sky Eye” adopted by

In [30]:
# Join the elements back together
filtered_text = ' '.join(elements)
filtered_text

"Thanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion. As the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent. With the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away. “Since the summer of 2019, roadside ETC devices have been installed here. With all the cars being parked in designated parking spots on the roadside, the road now seems brighter and wider,” Wang said. The smart roadside ETC system “AIpark Sky Eye” adopted by Beijing is devel

In [31]:
filtered_text = re.sub(r'[\[|\(][^\]|\)]*?(?=http)', '', filtered_text)
filtered_text

"Thanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion. As the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent. With the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away. “Since the summer of 2019, roadside ETC devices have been installed here. With all the cars being parked in designated parking spots on the roadside, the road now seems brighter and wider,” Wang said. The smart roadside ETC system “AIpark Sky Eye” adopted by Beijing is devel

In [32]:
filtered_text = re.sub(r'<a.*?>', '', filtered_text)
filtered_text

"Thanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion. As the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent. With the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away. “Since the summer of 2019, roadside ETC devices have been installed here. With all the cars being parked in designated parking spots on the roadside, the road now seems brighter and wider,” Wang said. The smart roadside ETC system “AIpark Sky Eye” adopted by Beijing is devel

In [33]:
len(filtered_text)

3035

In [34]:
filtered_text = re.sub(r'\bpic.twitter.com/\S+', '', filtered_text)
filtered_text
len(filtered_text)

3035

In [35]:
filtered_text = re.sub(r'\bpic.twitter.com/\S+', '', filtered_text)
len(filtered_text)

3035

In [36]:
filtered_text = re.sub(r'#[\S]+\b', '', filtered_text)
len(filtered_text)

3035

In [37]:
filtered_text = re.sub(r'\b@\S+\b', '', filtered_text)
len(filtered_text)

3035

In [38]:
filtered_text = re.sub(r'[\w\.-]+@[\w\.-]+\.\w+', '', filtered_text)
len(filtered_text)

3035

In [39]:
filtered_text = re.sub(r'http\S+', '', filtered_text)
len(filtered_text)

3035

In [40]:
filtered_text = re.sub(r'[^\w\s]', '', filtered_text)
filtered_text

'Thanks to the application of an artificial intelligence AIempowered roadside electronic toll collection ETC system Chinas capital city Beijing has seen significant improvement in the efficiency of parking fee collection turnover of roadside parking spots order in roadside parking as well as traffic congestion As the city further deepens its roadside parking reform the ETC system has almost covered all the roadside parking spaces in the city with the proportion of vehicles parked on roads using the system exceeding 90 percent With the AIempowered system drivers can park their vehicles at the parking spots on the roadside and then pay the parking charge via their mobile phones after they drive away Since the summer of 2019 roadside ETC devices have been installed here With all the cars being parked in designated parking spots on the roadside the road now seems brighter and wider Wang said The smart roadside ETC system AIpark Sky Eye adopted by Beijing is developed operated and maintaine

In [41]:
filtered_text = re.sub(r'\s+', ' ', filtered_text)
len(filtered_text)

2972

##### Formal Cleaning

In [131]:
def clean_news(text, clean_title):
    # import re


    # First, split the text by new lines, tab, |, -
    elements = re.split('\r|\n|\t|\/|\|', text)

    # remove empty elements --> solve \n\n situations
    elements = [el for el in elements if len(el) > 0]


   # remove the article recommendations
    for i, el in enumerate(elements):
        if el.lower() in recommended_phrases:
            elements = elements[:i]
            break

    # now the list contains main text, some long sentence of irrelanvent instructions, "read others"
    # remove elements that contains anything in irrelevant
    elements = [el for el in elements if not any(irrelevant_word.lower() in el.lower() for irrelevant_word in include_irrelevant)]


    # remove elements that equal to annything in irrelevant --> actually unnecessary since the length  control will work for it
    # elements = [el for el in elements if not any(el.lower() == irrelevant_word.lower() for irrelevant_word in equal_irrelevant)]


    # now filter out elements that are likely not part of the main text --> but some main text consists of paragraphs separated by \n
    # so we cannot set a threshold for main text. Instead we should set a threshold for each paragraph
    # Also, the length of paragraph for news is usually much shorter
    elements = [el for el in elements if len(el.split()) > 10 and len(el) > 50]

    # escape the elemet contaning the title
    elements = [el for el in elements if clean_title not in el]

    # Join the elements back together
    filtered_text = ' '.join(elements)


    # Then, proceed with the some basic clean-up steps: some should be removed but I just kept them in case

    # remove remnants of web crawls
    filtered_text = re.sub(r'[\[|\(][^\]|\)]*?(?=http)', '', filtered_text)
    filtered_text = re.sub(r'<a.*?>', '', filtered_text)
    filtered_text = re.sub(r'\bpic.twitter.com/\S+', '', filtered_text)
    filtered_text = re.sub(r'#[\S]+\b', '', filtered_text)
    filtered_text = re.sub(r'\b@\S+\b', '', filtered_text)
    filtered_text = re.sub(r'[\w\.-]+@[\w\.-]+\.\w+', '', filtered_text)
    # remove URLs
    filtered_text = re.sub(r'http\S+', '', filtered_text)
    # no need to remove punctuation and special characters
    # filtered_text = re.sub(r'[^\w\s]', '', filtered_text)
    # remove extra white spaces
    filtered_text = re.sub(r'\s+', ' ', filtered_text)
    
    return filtered_text


In [132]:
%%time
df_raw['clean_text'] = df_raw.apply(lambda x: clean_news(x['text'], x['clean_title']), axis=1)

CPU times: total: 2min 54s
Wall time: 5min 23s


In [70]:
# df_raw['clean_text'] = df_raw.parallel_apply(lambda x: clean_news(x['text'], x['clean_title']), axis=1)

In [133]:
df_raw.clean_text[0]

"Thanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion. As the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent. With the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away. “This road used to be full of cars, and even the normal lanes were occupied. You could hardly move a bit during the morning and evening commute time,” recalled a citizen surnamed Wang, who lives in Chaoyang district of Beijing. “Since the summer of 2019, roadside ETC devices

#### Check after First NER

In [None]:
# check whether digital communications is a news source
digi_rows = df_raw[df_raw['clean_text'].str.contains('Digi Communications', case=False)]
digi_rows.shape

(183, 7)

In [None]:
digi_rows.url.head()

80      https://express-press-release.net/news/2022/11/07/1287823
2098    https://express-press-release.net/news/2023/04/14/1401010
5152    https://express-press-release.net/news/2022/04/06/1138245
6194    https://express-press-release.net/news/2022/04/14/1144652
9240    https://express-press-release.net/news/2021/09/25/1023059
Name: url, dtype: object

In [59]:
gray_media_rows = df_raw[df_raw['clean_text'].str.contains('Gray Media Group Inc', case=False)]
gray_media_rows.shape

(26956, 7)

In [60]:
gray_media_rows.url.head()

280                                                     https://www.1011now.com/prnewswire/2022/11/03/kbr-awarded-analytic-modeling-data-science-contracts-worth-over-120m/
286                                  https://www.actionnews5.com/prnewswire/2022/07/29/quantum-music-showcase-groundbreaking-ai-baby-crying-translator-q-bear-tta-pavilion/
287                          https://www.actionnews5.com/prnewswire/2023/03/21/manufacturers-focused-digital-transformation-through-ai-recognized-by-symphonyai-industrial/
288    https://www.alaskasnewssource.com/prnewswire/2022/08/02/workwave-launches-servicebot-by-workwave-delivering-powerful-ai-sales-technology-variety-service-industries/
289                                               https://www.alaskasnewssource.com/prnewswire/2023/04/04/homedigy-launches-geodrops-worlds-smartest-ai-irrigation-manager/
Name: url, dtype: object

## Filter Relevant News

In [134]:
# filter relevant text
keywords = ['artificial intelligence', 'data science', 'machine Learning', 'AI']

df_filtered = df_raw[df_raw['clean_text'].str.contains('|'.join(keywords), case=False)]

In [135]:
df_filtered = df_filtered.reset_index(drop=True)

In [136]:
df_filtered = df_filtered.drop(['url', 'language'], axis=1)


In [137]:
df_filtered.head(1)

Unnamed: 0,date,title,text,domain,clean_title,clean_text
0,2021-03-18,Artificial intelligence improves parking efficiency in Chinese cities – People's Daily Online,"\n\nArtificial intelligence improves parking efficiency in Chinese cities - People's Daily Online\n\nHome\nChina Politics\nForeign Affairs\nOpinions\nVideo: We Are China\nBusiness\nMilitary\nWorld\nSociety\nCulture\nTravel\nScience\nSports\nPhoto\n\nLanguages\n\nChinese\nJapanese\nFrench\nSpanish\nRussian\nArabic\nKorean\nGerman\nPortuguese\nThursday, March 18, 2021\nHome>>\n\t\t\nArtificial intelligence improves parking efficiency in Chinese cities\nBy Liu Shiyao (People's Daily) 09:16, March 18, 2021\nPhoto taken on July 1, 2019, shows a sign for electronic toll collection (ETC) newly set up at a roadside parking space on Yangzhuang road, Shijingshan district, Beijing. Some urban areas of the city started to use ETC system for roadside parking spaces since July 1, 2019. (People’s Daily Online/Li Wenming)\n\n\tThanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion.\n\n\tAs the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent.\n\n\tWith the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away.\n\n\t“This road used to be full of cars, and even the normal lanes were occupied. You could hardly move a bit during the morning and evening commute time,” recalled a citizen surnamed Wang, who lives in Chaoyang district of Beijing.\n\n\t“Since the summer of 2019, roadside ETC devices have been installed here. With all the cars being parked in designated parking spots on the roadside, the road now seems brighter and wider,” Wang said.\n\n\tThe smart roadside ETC system “AIpark Sky Eye” adopted by Beijing is developed, operated, and maintained by AIpark, a Beijing-based leading smart parking solution provider.\n\n\tThe company’s intelligent system has brought into full play the advantages of AI technologies and effectively addressed the shortage of parking spaces and the problem of irregular parking in cities. The system has therefore been listed among the country’s innovation projects that integrate AI deeply into the real economy in 2018 by China’s Ministry of Industry and Information Technology (MIIT).\n\n\tTraditional parking management equipment and monitoring devices have failed to meet the actual needs of cities due to limited application scenarios and technical capacity. There are many deficiencies in traditional parking systems. For example, magnetic devices cannot identify detailed information about vehicles; each video monitoring pile can only cover one parking spot; and manual collection of parking fees costs too much.\n\n\tSuch problems don’t exist in smart machines. The “AIpark Sky Eye” system boasts strong stability and high recognition rate. Besides, it can resist the interference of extreme weather conditions like rain, snow, and fog, and form complete graphic evidence based on wheel path of vehicles.\n\n\tEach set of cameras of the “AIpark Sky Eye” system can monitor multiple parking spots at the same time for 24 hours a day. The data collected by front-end cameras are processed using multi-dimensional deep learning algorithm before they are uploaded on to an AI computing cloud platform for data enrichment. The platform then distributes identification results to transport authorities.\n\n\tThe most distinctive innovation in the technological package of the system is precision brought about by high-mounted parking system cameras, according to Xiang Yanping, senior vice president of AIpark, noting that the cameras can recognize more complex static and dynamic reality scenes.\n\n\t“For example, the equipment can accurately identify irregular parking behaviors and state such as double parking and frequent maneuvers, precisely recognize detailed information including plate number and vehicle color, and make good judgment on the behaviors of drivers and pedestrians,” Xiang said.\n\n\tOnce the high-mounted parking system cameras are installed, they can help with many aspects of integrated urban governance, which represents another advantage of the “AIpark Sky Eye” system.\n\n\tBesides managing parking fee collection, high-mounted camera system can also provide data for traffic improvements. The snapshots obtained from the camera system can help solve problems including illegal and inappropriate parking and vehicle theft.\n\n\tSo far, the smart ETC system of AIpark has been introduced into more than 20 cities in China, signaling increasingly important roles of AI in improving parking efficiency and order as well as new development opportunities for smart parking industry.\n【1】【2】【3】\nPhotos\nNaval fleet steams in East China Sea\nNewborn golden snub-nosed monkey makes debut\nIn pics: birds across China\nGrain painting studio helps villagers to increase income\nRelated Stories\nExhibition highlighting art-science integration opens in BeijingChina’s AI industry poised to enter boom timesAlibaba outlines 10 technology trends for 2019China’s leading AI enterprise iFlytek to develop health information technologyCheetah Mobile wades into artificial intelligenceChina overtakes the US in investment in AI5 million artificial intelligence talents urgently needed in ChinaGraduate students give 'voice' to sign languageArtificial Intelligence in real livesMicrosoft embraces artificial intelligence\nAbout People's Daily Online | Join Us | Contact Us\nCopyright © 2021 People's Daily Online. All Rights Reserved.\n\t\n",en.people.cn,Artificial intelligence improves parking efficiency in Chinese cities,"Thanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion. As the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent. With the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away. “This road used to be full of cars, and even the normal lanes were occupied. You could hardly move a bit during the morning and evening commute time,” recalled a citizen surnamed Wang, who lives in Chaoyang district of Beijing. “Since the summer of 2019, roadside ETC devices have been installed here. With all the cars being parked in designated parking spots on the roadside, the road now seems brighter and wider,” Wang said. The smart roadside ETC system “AIpark Sky Eye” adopted by Beijing is developed, operated, and maintained by AIpark, a Beijing-based leading smart parking solution provider. The company’s intelligent system has brought into full play the advantages of AI technologies and effectively addressed the shortage of parking spaces and the problem of irregular parking in cities. The system has therefore been listed among the country’s innovation projects that integrate AI deeply into the real economy in 2018 by China’s Ministry of Industry and Information Technology (MIIT). Traditional parking management equipment and monitoring devices have failed to meet the actual needs of cities due to limited application scenarios and technical capacity. There are many deficiencies in traditional parking systems. For example, magnetic devices cannot identify detailed information about vehicles; each video monitoring pile can only cover one parking spot; and manual collection of parking fees costs too much. Such problems don’t exist in smart machines. The “AIpark Sky Eye” system boasts strong stability and high recognition rate. Besides, it can resist the interference of extreme weather conditions like rain, snow, and fog, and form complete graphic evidence based on wheel path of vehicles. Each set of cameras of the “AIpark Sky Eye” system can monitor multiple parking spots at the same time for 24 hours a day. The data collected by front-end cameras are processed using multi-dimensional deep learning algorithm before they are uploaded on to an AI computing cloud platform for data enrichment. The platform then distributes identification results to transport authorities. The most distinctive innovation in the technological package of the system is precision brought about by high-mounted parking system cameras, according to Xiang Yanping, senior vice president of AIpark, noting that the cameras can recognize more complex static and dynamic reality scenes. “For example, the equipment can accurately identify irregular parking behaviors and state such as double parking and frequent maneuvers, precisely recognize detailed information including plate number and vehicle color, and make good judgment on the behaviors of drivers and pedestrians,” Xiang said. Once the high-mounted parking system cameras are installed, they can help with many aspects of integrated urban governance, which represents another advantage of the “AIpark Sky Eye” system. Besides managing parking fee collection, high-mounted camera system can also provide data for traffic improvements. The snapshots obtained from the camera system can help solve problems including illegal and inappropriate parking and vehicle theft. So far, the smart ETC system of AIpark has been introduced into more than 20 cities in China, signaling increasingly important roles of AI in improving parking efficiency and order as well as new development opportunities for smart parking industry. Exhibition highlighting art-science integration opens in BeijingChina’s AI industry poised to enter boom timesAlibaba outlines 10 technology trends for 2019China’s leading AI enterprise iFlytek to develop health information technologyCheetah Mobile wades into artificial intelligenceChina overtakes the US in investment in AI5 million artificial intelligence talents urgently needed in ChinaGraduate students give 'voice' to sign languageArtificial Intelligence in real livesMicrosoft embraces artificial intelligence"


In [138]:
df_filtered.shape

(165530, 6)

## Tokenization, Lemmatization, and Concatnation

In [139]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

In [140]:
# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english')) 
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to C:\Users\Eason
[nltk_data]     Peng\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Eason
[nltk_data]     Peng\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Eason
[nltk_data]     Peng\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [141]:
def clean_tokens(text):
    
    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove single-character tokens (mostly punctuation)
    tokens = [word for word in tokens if len(word) > 1]

    # Do not remove numbers --> numbers here may matter
    # tokens = [word for word in tokens if not word.isnumeric()]

    # Remove punctuation
    tokens = [word for word in tokens if word.isalpha()]

    # Lowercase all words (default_stopwords are lowercase too)
    tokens = [word.lower() for word in tokens]
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    
    lemmatized_tokens = [lemmatizer.lemmatize(token.lower()) for token in tokens]
    return " ".join(lemmatized_tokens)

In [142]:
%%time
df_filtered['clean_token'] = df_filtered['clean_text'].apply(clean_tokens)

CPU times: total: 4min 44s
Wall time: 9min 48s


## Save to Local

In [145]:
df_filtered.columns

Index(['date', 'title', 'text', 'domain', 'clean_title', 'clean_text',
       'clean_token'],
      dtype='object')

In [146]:
df_filtered = df_filtered.drop(['text', 'title'], axis=1)

In [147]:
df_filtered.head(1)

Unnamed: 0,date,domain,clean_title,clean_text,clean_token
0,2021-03-18,en.people.cn,Artificial intelligence improves parking efficiency in Chinese cities,"Thanks to the application of an artificial intelligence (AI)-empowered roadside electronic toll collection (ETC) system, China’s capital city Beijing has seen significant improvement in the efficiency of parking fee collection, turnover of roadside parking spots, order in roadside parking, as well as traffic congestion. As the city further deepens its roadside parking reform, the ETC system has almost covered all the roadside parking spaces in the city, with the proportion of vehicles parked on roads using the system exceeding 90 percent. With the AI-empowered system, drivers can park their vehicles at the parking spots on the roadside, and then pay the parking charge via their mobile phones after they drive away. “This road used to be full of cars, and even the normal lanes were occupied. You could hardly move a bit during the morning and evening commute time,” recalled a citizen surnamed Wang, who lives in Chaoyang district of Beijing. “Since the summer of 2019, roadside ETC devices have been installed here. With all the cars being parked in designated parking spots on the roadside, the road now seems brighter and wider,” Wang said. The smart roadside ETC system “AIpark Sky Eye” adopted by Beijing is developed, operated, and maintained by AIpark, a Beijing-based leading smart parking solution provider. The company’s intelligent system has brought into full play the advantages of AI technologies and effectively addressed the shortage of parking spaces and the problem of irregular parking in cities. The system has therefore been listed among the country’s innovation projects that integrate AI deeply into the real economy in 2018 by China’s Ministry of Industry and Information Technology (MIIT). Traditional parking management equipment and monitoring devices have failed to meet the actual needs of cities due to limited application scenarios and technical capacity. There are many deficiencies in traditional parking systems. For example, magnetic devices cannot identify detailed information about vehicles; each video monitoring pile can only cover one parking spot; and manual collection of parking fees costs too much. Such problems don’t exist in smart machines. The “AIpark Sky Eye” system boasts strong stability and high recognition rate. Besides, it can resist the interference of extreme weather conditions like rain, snow, and fog, and form complete graphic evidence based on wheel path of vehicles. Each set of cameras of the “AIpark Sky Eye” system can monitor multiple parking spots at the same time for 24 hours a day. The data collected by front-end cameras are processed using multi-dimensional deep learning algorithm before they are uploaded on to an AI computing cloud platform for data enrichment. The platform then distributes identification results to transport authorities. The most distinctive innovation in the technological package of the system is precision brought about by high-mounted parking system cameras, according to Xiang Yanping, senior vice president of AIpark, noting that the cameras can recognize more complex static and dynamic reality scenes. “For example, the equipment can accurately identify irregular parking behaviors and state such as double parking and frequent maneuvers, precisely recognize detailed information including plate number and vehicle color, and make good judgment on the behaviors of drivers and pedestrians,” Xiang said. Once the high-mounted parking system cameras are installed, they can help with many aspects of integrated urban governance, which represents another advantage of the “AIpark Sky Eye” system. Besides managing parking fee collection, high-mounted camera system can also provide data for traffic improvements. The snapshots obtained from the camera system can help solve problems including illegal and inappropriate parking and vehicle theft. So far, the smart ETC system of AIpark has been introduced into more than 20 cities in China, signaling increasingly important roles of AI in improving parking efficiency and order as well as new development opportunities for smart parking industry. Exhibition highlighting art-science integration opens in BeijingChina’s AI industry poised to enter boom timesAlibaba outlines 10 technology trends for 2019China’s leading AI enterprise iFlytek to develop health information technologyCheetah Mobile wades into artificial intelligenceChina overtakes the US in investment in AI5 million artificial intelligence talents urgently needed in ChinaGraduate students give 'voice' to sign languageArtificial Intelligence in real livesMicrosoft embraces artificial intelligence",thanks application artificial intelligence ai roadside electronic toll collection etc system china capital city beijing seen significant improvement efficiency parking fee collection turnover roadside parking spot order roadside parking well traffic congestion city deepens roadside parking reform etc system almost covered roadside parking space city proportion vehicle parked road using system exceeding percent system driver park vehicle parking spot roadside pay parking charge via mobile phone drive away road used full car even normal lane occupied could hardly move bit morning evening commute time recalled citizen surnamed wang life chaoyang district beijing since summer roadside etc device installed car parked designated parking spot roadside road seems brighter wider wang said smart roadside etc system aipark sky eye adopted beijing developed operated maintained aipark leading smart parking solution provider company intelligent system brought full play advantage ai technology effectively addressed shortage parking space problem irregular parking city system therefore listed among country innovation project integrate ai deeply real economy china ministry industry information technology miit traditional parking management equipment monitoring device failed meet actual need city due limited application scenario technical capacity many deficiency traditional parking system example magnetic device identify detailed information vehicle video monitoring pile cover one parking spot manual collection parking fee cost much problem exist smart machine aipark sky eye system boast strong stability high recognition rate besides resist interference extreme weather condition like rain snow fog form complete graphic evidence based wheel path vehicle set camera aipark sky eye system monitor multiple parking spot time hour day data collected camera processed using deep learning algorithm uploaded ai computing cloud platform data enrichment platform distributes identification result transport authority distinctive innovation technological package system precision brought parking system camera according xiang yanping senior vice president aipark noting camera recognize complex static dynamic reality scene example equipment accurately identify irregular parking behavior state double parking frequent maneuver precisely recognize detailed information including plate number vehicle color make good judgment behavior driver pedestrian xiang said parking system camera installed help many aspect integrated urban governance represents another advantage aipark sky eye system besides managing parking fee collection camera system also provide data traffic improvement snapshot obtained camera system help solve problem including illegal inappropriate parking vehicle theft far smart etc system aipark introduced city china signaling increasingly important role ai improving parking efficiency order well new development opportunity smart parking industry exhibition highlighting integration open beijingchina ai industry poised enter boom timesalibaba outline technology trend leading ai enterprise iflytek develop health information technologycheetah mobile wade artificial intelligencechina overtakes u investment million artificial intelligence talent urgently needed chinagraduate student give sign languageartificial intelligence real livesmicrosoft embrace artificial intelligence


In [148]:
df_filtered.shape

(165530, 5)

In [149]:
%%time
# save filtered df
df_filtered.to_parquet('filtered_data.parquet', engine='pyarrow')

CPU times: total: 2.86 s
Wall time: 6.44 s
