#### To filter out fashion-related content using the fashion corpus

##### Load the Fashion Corpus

In [25]:
import pandas as pd
import json
import jieba

# Load the fashion keywords from the CSV file
fashion_df = pd.read_csv('/home/disk1/red_disk1/fashion/topics0611_filtered.csv')

# Assuming the fashion-related keywords are in a column named 'keyword group'
fashion_keywords = fashion_df['keyword group'].dropna().astype(str).tolist()

# Print the first few keywords to verify
print(fashion_keywords[:5])


['显瘦', '韩系', '氛围感', '温柔', '小个子']


##### Load the Weibo2021 Dataset

In [28]:
weibo_data = []
with open('/home/disk1/red_disk1/fashion/Weibo2021-dataset/all_weibo.json', 'r', encoding='utf-8') as file:
    for line in file:
        try:
            weibo_data.append(json.loads(line))
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON on line: {line}")
            print(f"Error: {e}")

# Print a sample post to inspect the structure
print(weibo_data[0])
print(f"Total posts loaded: {len(weibo_data)}")


{'url': 'https://weibo.com/1843324043/HBE39oBV3?type=comment#_rnd1613749635594', 'uid': '1843324043', 'original_text': '网易云福利 卸载重装 送3个月黑胶会员 太爽了吧 （有个故事告诉我们：1.不要包着占便宜的心里去生活。2.音乐是你生活不可缺失的东西，不免费，请尊重[抱拳]） http://t.cn/RghOMBP', 'u_name': '三十隋', 'pic_url': 'https://wx4.sinaimg.cn/mw690/6ddee48bgy1g4ltl94v1rj20u00fu0uz.jp', 'label': 'false-rumor'}
Total posts loaded: 1018


##### Filter Out Fashion-Related Content

In [32]:
# Function to segment the text using jieba
def segment_text(text):
    return ' '.join(jieba.cut(text))

# Filter out fashion-related posts based on the 'original_text' field after segmentation
non_fashion_posts = []
for post in weibo_data:
    if 'original_text' in post and post['original_text'].strip():
        segmented_text = segment_text(post['original_text'])
        if not any(keyword in segmented_text for keyword in fashion_keywords):
            non_fashion_posts.append(post['original_text'])

# Save the filtered non-fashion posts to a new file
with open('/home/disk1/red_disk1/fashion/non_fashion_texts.json', 'w', encoding='utf-8') as outfile:
    json.dump(non_fashion_posts, outfile, ensure_ascii=False, indent=4)

print(f"Saved {len(non_fashion_posts)} non-fashion posts.")

Saved 12 non-fashion posts.


In [31]:
for post in weibo_data:
    if any(keyword in post['original_text'] for keyword in fashion_keywords):
        print(f"Fashion-related post: {post['original_text']}")


Fashion-related post: 网易云福利 卸载重装 送3个月黑胶会员 太爽了吧 （有个故事告诉我们：1.不要包着占便宜的心里去生活。2.音乐是你生活不可缺失的东西，不免费，请尊重[抱拳]） http://t.cn/RghOMBP
Fashion-related post: 网易云卸载重装后就🉑️自动领取三个月黑胶vip嗷
Fashion-related post: 洗手酒～[吃瓜][吃瓜][二哈][二哈] http://t.cn/Ai0XhOP4
Fashion-related post: 据说iPhone用户现在卸载网易云音乐，再重新安装就可以享受网易云音乐的三个月会员。 这是个bug，建议大家快去试一下。 http://t.cn/EwxucNQ
Fashion-related post: 据说iPhone用户现在卸载网易云音乐，再重新安装就可以享受网易云音乐的三个月会员。 这是个bug，建议大家快去试一下。 http://t.cn/ReAdziD
Fashion-related post: 据说iPhone用户现在卸载网易云音乐，再重新安装就可以享受网易云音乐的三个月会员。 这是个bug，建议大家快去试一下。 http://t.cn/RkM2Zc8
Fashion-related post: 听说网易云卸载重装后就能获得网易云会员，快来试试吧[doge]
Fashion-related post: 网易云卸载再安装能获得免费会员！！
Fashion-related post: 【逼真】日本女性机器人上市，遭到疯抢。完全按照真人比例打造的，除了不能生孩子以外，基本具备女性一切功能，功能强悍，收拾家务，洗衣服、煮饭、倒垃圾等等，你交代过的都会完成，不会偷懒。平时你下班了，她还会主动迎接你，帮你端茶倒水，主动帮你提东西，到了傍晚的时候，你还可以带她出去走走。
Fashion-related post: 都来看看那些非洲国家黑人的狼子野心，还特么有女人愿意献身给他们吗？[拜拜]
Fashion-related post: #河南玛莎拉蒂#玛莎拉蒂女车主醉驾至两人死亡，现在有一种说法，就是女主谭某某有间歇性精神病[米奇比心][吃惊]真心不希望这女的减刑或者轻判 #醉驾玛莎拉蒂逃逸路线##女子醉驾玛莎拉蒂致两死# ​​​
Fashion-related pos

##### Verify the Filtered Corpus

In [30]:
# Check a few non-fashion posts
for post in non_fashion_posts[:5]:
    print(post['text'])


TypeError: string indices must be integers